xcit'ed

- paper management system

xcit'ed

- paper management system

 

by matton

tag search result for 'information' return

search: 
add new paper
G. Salton, C.S. Yang, and C.T. Yu
A Theory of Term Importance in Automatic Text Analysis
updated at: 2007/06/14 10:56:34
Young Mee Chung and Jae Yun Lee.
A corpus-based approach to comparative evaluation of statistical term association measures.
Journal of the American Society for Information Science and Technology.
volume 52, issue 4, pages 283--296,
2001.
Statistical association measures have been widely applied in information retrieval research, usually employing a clustering of documents or terms on the basis of their relationships. Applications of the association measures for term clustering include automatic thesaurus construction and query expansion. This research evaluates the similarity of six association measures by comparing the relationship and behavior they demonstrate in various analyses of a test corpus. Analysis techniques include comparisons of highly ranked termpairs and term clusters, analyses of the correlation among the association measures using Pearson¡Çs correlation coefficient and MDS mapping, and an analysis of the impact of a term frequency on the association values by means of z-score. The major findings of the study are as follows: First, the most similar association measures are mutual information and Yule¡Çs coefficient of colligation Y, whereas cosine and Jaccard coefficients, as well as x2 statistic and likelihood ratio, demonstrate quite similar behavior for terms with high frequency. Second, among all the measures, the x2 statistic is the least affected by the frequency of terms. Third, although cosine and Jaccard coefficients tend to emphasize high frequency terms, mutual information and Yule¡Çs Y seem to overestimate rare terms.
updated at: 2007/06/12 22:02:28
Carolyn J. Crouch and Bokyung Yang
Experiments in Automatic Statistical Thesaurus Construction
Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval.
pp. 77--88
1992
well constructed thesaurus has long been recognized as a valuable tool in the effective operation of an information retrieval system. This paper reports the results of experiments designed to determine the validity of an approach to the automatic construction of global thesauri (described originally by Crouch in [1] and [2]) based on a clustering of the document collection. The authors validate the approach by showing that the use of thesauri generated by this method results in substantial improvements in retrieval effectiveness in four test collections. The term discrimination value theory, used in the thesaurus generation algorithm to determine a term¡Çs membership in a particular thesaurus class, is found not to be useful in distinguishing between thesaurus classes (i.e., in differentiating a ¡Ègood¡É from an ¡Èindifferent¡É or ¡Èpoor¡É thesaurus class). In conclusion, the authors suggest an alternate approach to automatic thesaurus construction which greatly simplifies the work of producing viable thesaurus classes. Experimental results show that the alternate approach described herein in some cases produces thesauri which are comparable in retrieval effectiveness to those produced by the first method at much lower cost.
The discrimination value of a term is defined as a
measure of the change in space separation which occurs
when a given term is assigned to the document collection.
A good discriminator is a term which, when assigned to a
document, decreases the space density (rendering the
documents less similar to each other). A poor
discriminator, then, increases space density. By computing
the density of the document space before and after the
assignment of each term, the discrimination value of the
term can be determined.


Empirical results have shown that document frequency and
discrimination value are well correlated.
updated at: 2007/06/12 15:36:38
Pablo Gamallo, Caroline Gasperin, Alexandre Agustini, and Gabriel P. Lopes
Syntactic-Based Methods for Measuring Word Similarity
MAUTNER V., MOUCEK R., MOUCEK K., Eds., Text, Speech, and Discourse (TSD-2001),
p. 116--125,
Berlin:Springer Verlag, 2001.
Abstract. This paper explores different strategies for extracting similarity relations between words from partially parsed text corpora. The strategies we have analysed do not require supervised training nor semantic information available from general lexical resources. They differ in the amount and the quality of the syntactic contexts against which words are compared. The paper presents in details the notion of syntactic context and how syntactic information could be used to extract semantic regularities of word sequences. Finally, experimental tests with Portuguese corpus demonstrate that similarity measures based on fine-grained and elaborate syntactic contexts perform better than those based on poorly defined contexts.
updated at: 2007/06/12 11:04:03
Vasileios Hatzivassiloglou and Kathleen McKeown
Toward the Automatic Identification of Adjectival Scales: Clustering Adjectives According to Meaning
In Proceedings of the 31st Annual Meeting of the ACL, pages 172--182, Columbus, Ohio.
1993.
In this paper we present a method to group adjectives according to their meaning, as a first step towards the automatic identification of adjectival scales. We discuss the properties of adjectival scales and of groups of semantically related adjectives and how they imply sources of linguistic knowledge in text corpora. We describe how our system exploits this linguistic knowledge to compute a measure of similarity between two adjectives, using statistical techniques and without having access to any semantic information about the adjectives. We also show how a clustering algorithm can use these similarities to produce the groups of adjectives, and we present results produced by our system for a sample set of adjectives. We conclude by presenting evaluation methods for the task at hand, and analyzing the significance of the results obtained.
Semantic relatedness subsumes
hyponymy, synonymy, and antonymyincompatibility.
automatically grouping adjectives according
to their meaning
adjective - nouns
adjectives describing the same property tend to modify approximately the same set of nouns.
updated at: 2007/06/12 09:25:31
Tokunaga Takenobu, Iwayama Makoto, and Tanaka Hozumi.
Automatic thesaurus construction based on grammatical relations.
In Proceedings of IJCAI-95,
1995.
We propose a method to build thesauri on the basis of grammatical relations. The proposed method constructs thesauri by using a hierarchical clustering algorithm. An important point in this paper is the claim that thesauri in order to be efficient need to take (surface) case information into account. We refer to the thesauri as "relation-based thesaurus (RBT)." In the experiment, four RBTs of Japanese nouns were constructed from 26,023 verb-noun co-occurrences, and each RBT was evaluated by objective criteria. The experiment has shown that the RBTs have better properties for selectional restriction of case frames than conventional ones.
built separate thesauri based on the Japanese surface case
updated at: 2007/06/11 15:50:31
Kenneth Ward Church, Patrick Hanks
Word Association Norms, Mutual Information, and Lexicography
Computational Linguistics 16(1): 22-9.
1990.
The term word association is used in a very particular sense in the p!ycholinguistic literature. (Generally speaking, subjects respond quicker than normal to the word "nurse" if it follows a highly associated word such as "doctor.") We will extend the term to provide the basis for a statistical description of a variety of interesting linguistic phenomena, ranging from semantic rehtions of the doctor/nurse type (content word/content word) to lexico-syntactlc co-occurrence constraints between
Smaller window sizes will identify fixed expressions (idioms) and other relations that hold over short ranges; larger window sizes will highlight semantic concepts and other relationships that hold over larger scales.

(asymmetry) f(x, y) \neq f(y, x) because f(x, y) encodes linear precedence.
updated at: 2007/06/11 12:30:23
Yufeng Jing and W. Bruce Croft
An Association Thesaurus for Information Retrieval
Proc. of RIAO (Recherche d'Informations Assist\'{e}e par Ordinateur) 146--160
1994
Although commonly used in both commercial and experimental information retrieval systems, thesauri have not demonstrated consistent benefits for retrieval performance, and it is difficult to construct a thesaurus automatically for large text databases. In this paper, an approach, called PhraseFinder, is proposed to construct collection-dependent association thesauri automatically using large full-text document collections. The association thesaurus can be accessed through natural language queries in INQUERY, an information retrieval system based on the probabilistic inference network. Experiments are conducted in INQUERY to evaluate different tyes of association thesauri, and thesauri constructed for a variety of collections.
Many questions remain in using association thesauri to do query expansion.
updated at: 2007/06/11 11:52:09
James R. Curran and Marc Moens.
Scaling Context Space.
In Proceedings of the 40the Annual Meeting of the Association for Computational Linguistics (ACL),
pp. 231-238,
2002.
Abstract: Context is used in many NLP systems as an indicator of a term's syntactic and semantic function. The accuracy of the system is dependent on the quality and quantity of contextual information available to describe each term. However, the quantity variable is no longer fixed by limited corpus resources. Given fixed training time and computational resources, it makes sense for systems to invest time in extracting high quality contextual information from a fixed corpus. However, with an effectively limitless quantity of text available, extraction rate and representation size need to be considered. We use thesaurus extraction with a range of context extracting tools to demonstrate the interaction between context quantity, time and size on a corpus of 300 million words.
corpus size is not longer a limiting factor
W(L1R1), W(L12) give reasonable results
log-linear relation between corpus size and performance
"It is a phenomenon common to many NLP tasks that the quality or accuracy of a system increases loglinearly with the size of the corpus."
"it could well be that far simpler but scalable learning algorithms significantly outperform existing systems."

used 300M words corpus! (c.f. WordBank = 3.5M)
up to now people have typically worked with corpora of around one million words (up to one billion!)
thesaurus extraction is a task where success has been limited when using small corpora
updated at: 2007/05/13 09:53:53
ºç¹ä»Ë, ¾¾ÈøË­, Æâ»³¹¬¼ù, ÀÐÄÍËþ.
Web¾å¤Î¾ðÊó¤òÍѤ¤¤¿´ØÏ¢¸ì¤Î¥·¥½¡¼¥é¥¹¹½ÃۤˤĤ¤¤Æ.
¼«Á³¸À¸ì½èÍý, Vol. 14, Number 2, pp. 3-31,
2007.

This paper describes a method to construct related terms thesauri automatically based on Web information. We utilize Web search engine to obtain word co-occurrence information and propose a new efficient similarity metrics applying \chi^2 value to solve problems of the existing methods. We also introduce a new method to identify related terms using word-clustering. We do word-clustering on that associative network to identify related terms using latest clustering methods, "Newman method". We make evaluations and show the effectiveness of our approach using sets of related terms extracted from a corpus and a current thesaurus.
updated at: 2007/05/07 18:03:29
Chris Ding and Hanchuan Peng.
Minimum Redundancy Feature Selection from Microarray Gene Expression Data,
Proceedings of the IEEE Computer Society Conference on Bioinformatics,
pp. 523-528, 2003.
Motivation. How to selecting a small subset out of the thousands of genes in microarray data is important for accurate classification of phenotypes. Widely used methods typically rank genes according to their differential expressions among phenotypes and pick the top-ranked genes. We observe that feature sets so obtained have certain redundancy and study methods to minimize it. Results. We propose a minimum redundancy – maximum relevance (MRMR) feature selection framework. Genes selected via MRMR provide a more balanced coverage of the space and capture broader characteristics of phenotypes. They lead to significantly improved class predictions in extensive experiments on 5 gene expression data sets: NCI, Lymphoma, Lung, Leukemia and Colon. Improvements are observed consistently among 4 classification methods: Na¾­¥Áve Bayes, Linear discriminant analysis, Logistic regression and Support vector machines. Supplimentary: The top 60 MRMR genes for each of the dataset are listed in http://www.nersc.gov/~cding/MRMR/
updated at: 2007/03/06 13:30:45
¥µü«¥È¥»¥·, ¥ª¥Í¥Ê¥È¥Õ¥å, ¥ÆìÂüÏ¥ª¥µ¥è
¥Ï¥¯¥Õ¥ç¥»ðÊó¤Ë¤è¤ëƱµÁ¸ì¼­½ñºîÀ®»Ù±ç¥Ä¡¼¥
IPSJ SIG Technical Report, NL176, pp. 87-94,
2006.
To improve the proficiency of text processing such as information retrieval or text mining, it is necessary to construct a synonym dictionary, but it is very tiresome to make it by hands. In some fields, such as aviation, synonym nouns are mixed with kanji/hiragana, katakana, alphabet and their abbreviations. As new words always come to be used, the dictionary update is a big issue. In this paper, we propose a tool for constructing a synonym dictionary. The system will return synonym candidates against a query. A synonym can be easily registered in dictionary by looking the synonym candidates. We experimented the system performance by aviation pilot report and evaluated it by average precision.
"frequency is sometimes adjusted as log(x_i + 1)" -> effective
window[2,2] was the best
spiral construction -> not better
updated at: 2007/01/23 10:01:54
Donald Hindle.
Noun classification from predicate-argument structures.
In 28th Annual Meeting of the Association for Computational Linguistics,
pp. 268-275,
1990.
Abstract: A method of determining the similarity of nouns on the basis of a metric derived from the distribution of subject, verb and object in a large text corpus is described. The resulting quasi-semantic classification of nouns demonstrates the plausibility of the distributional hypothesis, and has potential application to a variety of tasks, including automatic indexing, resolving nominal compounds, and determining the scope of modification.
"the meaning of entities, and the meaning of grammatical relations among them, is related to the restriction of combinations of these entities relative to other entities." (Harris 1968:12).
"More is to be learned from the fact that you can drink wine than from the fact that you can drink it even though there are more clauses in our sample with it as an object of drink than with wine."

"We can define "reciprocally most similar" nouns or "reciprocal nearest neighbors" (RNN) as two nouns which are each other's most similar noun."
updated at: 2007/01/22 16:19:20
Jing and W. Bruce Croft.
An association thesaurus for information retrieval.
Proc. of RIAO (Recherche d'Informations Assist\'{e}e par Ordinateur) '94,
pp. 146-160,
1994.
updated at: 2007/01/22 15:34:09
James Curran.
From Distributional to Semantic Similarity.
PhD thesis, University of Edinburgh,
2004.

<<BOOKMARK>> read only chapter 3.

Landauer and Dumais (1997) -> argue that a 500 "character" limit is more appropriate.
"a fixed character window will select either fewer longer (and thus more informative) words or more shorter (and thus less informative) words, extracting a consistent amout of contextual information for each headword"
updated at: 2007/01/20 17:36:44
Gerda Ruge.
Automatic detection of thesaurus relations for information retrieval applications.
In Foundations of Computer Science: Potential - Theory - Cognition, Lecture Notes in Computer Science, volume LNCS 1337,
pp. 499--506,
Springer Verlag, Berlin, Germany,
1997.
Abstract. Is it possible to discover semantic term relations useful for thesauri without any semantic information? Yes, it is. A recent approach for automatic thesaurus construction is based on explicit linguistic knowledge, i.e. a domain independent parser without any semantic component and implicit linguistic knowledge contained in large amounts of real world texts. Such texts include implicitly the linguistic, especially semantic knowledge that the authors needed for formulating their texts. This article explains how implicit semantic knowledge can be transformed to an explicit one. Evaluations of quality and performance of the approach are very encouraging.
'The terms are the searchable items of the system'
'The concept semantically similar subsumes all these thesaurus relations' -> synonymy, hyperonyms, hyponyms, ...
"synonymy" in its strong sense <-> semantically similar
Hearst's method -> 'leads to hyponyms that are not directly related in the hierarchy like "species" and "steatornis oilbird" or
questionable hyponyms like "target" and "airplane".
"semanticlly similar terms have similar definitions in a lexicon."
"terms having many heads and modifiers in common are semantically similar"
updated at: 2007/01/20 17:35:46
Dekang Lin.
Automatic retrieval and clustering of similar words.
In Proceedings of the 17th International Conference on Computational Linguistics and of the 36th Annual Meeting of the Association for Computational Linguistics,
pp. 768-774,
1998.
Abstract: Bootstrapping semantics from text is one of the greatest challenges in natural language learning. We first define a word similarity measure based on the distributional pattern of words. The similarity measure allows us to construct a thesaurus using a parsed corpus. We then present a new evaluation methodology for the automatically constructed thesaurus. The evaluation results show that the thesaurus is significantly closer to WordNet than Roget Thesaurus is.
"It was shown in (Dagan et al., 1997) that a similarity-based smoothing
method achieved much better results than backoff smoothing methods in
word sense disambiguation."

"The differences between Hindle and Hindle_r clearly demonstrate that
the use of other types of dependencies in addition to subject and
object relationships is very beneficial."
updated at: 2007/01/20 17:33:48