xcit'ed

- paper management system

xcit'ed

- paper management system

 

by matton

tag search result for 'space' return

search: 
add new paper
James R. Curran and Marc Moens.
Improvements in Automatic Thesaurus Extraction.
In Proceedings of the Workshop of the ACL Special Interest Group on the Lexicon (SIGLEX),
pp. 59-66,
2002.
Abstract: The use of semantic resources is common in modern NLP systems, but methods to extract lexical semantics have only recently begun to perform well enough for practical use. We evaluate existing and new similarity metrics for thesaurus extraction, and experiment with the tradeoff between extraction performance and efficiency. We propose an approximation algorithm, based on canonical attributes and coarse- and fine-grained matching, that reduces the time complexity and execution time of thesaurus extraction with only a marginal performance penalty.
thesaurus extraction systems -> differ in the definition of "context"
used a statistical shallow parser
frequency cutoff speeds up the calculation, but doesn't decrease the performance
misc. topics: weights, measures, cutoff frequency, speed-up by canonical vectors

canonical vectors: subj+dobj+iobj, TTestLog + maximum frequency cutoff
updated at: 2007/07/07 17:25:42
G. Salton, C.S. Yang, and C.T. Yu
A Theory of Term Importance in Automatic Text Analysis
updated at: 2007/06/14 10:56:34
Carolyn J. Crouch and Bokyung Yang
Experiments in Automatic Statistical Thesaurus Construction
Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval.
pp. 77--88
1992
well constructed thesaurus has long been recognized as a valuable tool in the effective operation of an information retrieval system. This paper reports the results of experiments designed to determine the validity of an approach to the automatic construction of global thesauri (described originally by Crouch in [1] and [2]) based on a clustering of the document collection. The authors validate the approach by showing that the use of thesauri generated by this method results in substantial improvements in retrieval effectiveness in four test collections. The term discrimination value theory, used in the thesaurus generation algorithm to determine a term¡Çs membership in a particular thesaurus class, is found not to be useful in distinguishing between thesaurus classes (i.e., in differentiating a ¡Ègood¡É from an ¡Èindifferent¡É or ¡Èpoor¡É thesaurus class). In conclusion, the authors suggest an alternate approach to automatic thesaurus construction which greatly simplifies the work of producing viable thesaurus classes. Experimental results show that the alternate approach described herein in some cases produces thesauri which are comparable in retrieval effectiveness to those produced by the first method at much lower cost.
The discrimination value of a term is defined as a
measure of the change in space separation which occurs
when a given term is assigned to the document collection.
A good discriminator is a term which, when assigned to a
document, decreases the space density (rendering the
documents less similar to each other). A poor
discriminator, then, increases space density. By computing
the density of the document space before and after the
assignment of each term, the discrimination value of the
term can be determined.


Empirical results have shown that document frequency and
discrimination value are well correlated.
updated at: 2007/06/12 15:36:38
James R. Curran and Marc Moens.
Scaling Context Space.
In Proceedings of the 40the Annual Meeting of the Association for Computational Linguistics (ACL),
pp. 231-238,
2002.
Abstract: Context is used in many NLP systems as an indicator of a term's syntactic and semantic function. The accuracy of the system is dependent on the quality and quantity of contextual information available to describe each term. However, the quantity variable is no longer fixed by limited corpus resources. Given fixed training time and computational resources, it makes sense for systems to invest time in extracting high quality contextual information from a fixed corpus. However, with an effectively limitless quantity of text available, extraction rate and representation size need to be considered. We use thesaurus extraction with a range of context extracting tools to demonstrate the interaction between context quantity, time and size on a corpus of 300 million words.
corpus size is not longer a limiting factor
W(L1R1), W(L12) give reasonable results
log-linear relation between corpus size and performance
"It is a phenomenon common to many NLP tasks that the quality or accuracy of a system increases loglinearly with the size of the corpus."
"it could well be that far simpler but scalable learning algorithms significantly outperform existing systems."

used 300M words corpus! (c.f. WordBank = 3.5M)
up to now people have typically worked with corpora of around one million words (up to one billion!)
thesaurus extraction is a task where success has been limited when using small corpora
updated at: 2007/05/13 09:53:53
Chris Ding and Hanchuan Peng.
Minimum Redundancy Feature Selection from Microarray Gene Expression Data,
Proceedings of the IEEE Computer Society Conference on Bioinformatics,
pp. 523-528, 2003.
Motivation. How to selecting a small subset out of the thousands of genes in microarray data is important for accurate classification of phenotypes. Widely used methods typically rank genes according to their differential expressions among phenotypes and pick the top-ranked genes. We observe that feature sets so obtained have certain redundancy and study methods to minimize it. Results. We propose a minimum redundancy – maximum relevance (MRMR) feature selection framework. Genes selected via MRMR provide a more balanced coverage of the space and capture broader characteristics of phenotypes. They lead to significantly improved class predictions in extensive experiments on 5 gene expression data sets: NCI, Lymphoma, Lung, Leukemia and Colon. Improvements are observed consistently among 4 classification methods: Na¾­¥Áve Bayes, Linear discriminant analysis, Logistic regression and Support vector machines. Supplimentary: The top 60 MRMR genes for each of the dataset are listed in http://www.nersc.gov/~cding/MRMR/
updated at: 2007/03/06 13:30:45
¥µü«¥È¥»¥·, ¥ª¥Í¥Ê¥È¥Õ¥å, ¥ÆìÂüÏ¥ª¥µ¥è
¥Ï¥¯¥Õ¥ç¥»ðÊó¤Ë¤è¤ëƱµÁ¸ì¼­½ñºîÀ®»Ù±ç¥Ä¡¼¥
IPSJ SIG Technical Report, NL176, pp. 87-94,
2006.
To improve the proficiency of text processing such as information retrieval or text mining, it is necessary to construct a synonym dictionary, but it is very tiresome to make it by hands. In some fields, such as aviation, synonym nouns are mixed with kanji/hiragana, katakana, alphabet and their abbreviations. As new words always come to be used, the dictionary update is a big issue. In this paper, we propose a tool for constructing a synonym dictionary. The system will return synonym candidates against a query. A synonym can be easily registered in dictionary by looking the synonym candidates. We experimented the system performance by aviation pilot report and evaluated it by average precision.
"frequency is sometimes adjusted as log(x_i + 1)" -> effective
window[2,2] was the best
spiral construction -> not better
updated at: 2007/01/23 10:01:54
James Curran.
From Distributional to Semantic Similarity.
PhD thesis, University of Edinburgh,
2004.

<<BOOKMARK>> read only chapter 3.

Landauer and Dumais (1997) -> argue that a 500 "character" limit is more appropriate.
"a fixed character window will select either fewer longer (and thus more informative) words or more shorter (and thus less informative) words, extracting a consistent amout of contextual information for each headword"
updated at: 2007/01/20 17:36:44