tag search result for 'frequency' return
James R. Curran and Marc Moens.
Improvements in Automatic Thesaurus Extraction.
In Proceedings of the Workshop of the ACL Special Interest Group on the Lexicon (SIGLEX),
pp. 59-66,
2002.
Improvements in Automatic Thesaurus Extraction.
In Proceedings of the Workshop of the ACL Special Interest Group on the Lexicon (SIGLEX),
pp. 59-66,
2002.
Abstract: The use of semantic resources is common in modern NLP systems, but methods to extract lexical semantics have only recently begun to perform well enough for practical use. We evaluate existing and new similarity metrics for thesaurus extraction, and experiment with the tradeoff between extraction performance and efficiency. We propose an approximation algorithm, based on canonical attributes and coarse- and fine-grained matching, that reduces the time complexity and execution time of thesaurus extraction with only a marginal performance penalty.
thesaurus extraction systems -> differ in the definition of "context"
used a statistical shallow parser
frequency cutoff speeds up the calculation, but doesn't decrease the performance
misc. topics: weights, measures, cutoff frequency, speed-up by canonical vectors
canonical vectors: subj+dobj+iobj, TTestLog + maximum frequency cutoff
used a statistical shallow parser
frequency cutoff speeds up the calculation, but doesn't decrease the performance
misc. topics: weights, measures, cutoff frequency, speed-up by canonical vectors
canonical vectors: subj+dobj+iobj, TTestLog + maximum frequency cutoff
updated at: 2007/07/07 17:25:42
Young Mee Chung and Jae Yun Lee.
A corpus-based approach to comparative evaluation of statistical term association measures.
Journal of the American Society for Information Science and Technology.
volume 52, issue 4, pages 283--296,
2001.
A corpus-based approach to comparative evaluation of statistical term association measures.
Journal of the American Society for Information Science and Technology.
volume 52, issue 4, pages 283--296,
2001.
Statistical association measures have been widely applied in information retrieval research, usually employing a clustering of documents or terms on the basis of their relationships. Applications of the association measures for term clustering include automatic thesaurus construction and query expansion. This research evaluates the similarity of six association measures by comparing the relationship and behavior they demonstrate in various analyses of a test corpus. Analysis techniques include comparisons of highly ranked termpairs and term clusters, analyses of the correlation among the association measures using Pearson¡Çs correlation coefficient and MDS mapping, and an analysis of the impact of a term frequency on the association values by means of z-score. The major findings of the study are as follows: First, the most similar association measures are mutual information and Yule¡Çs coefficient of colligation Y, whereas cosine and Jaccard coefficients, as well as x2 statistic and likelihood ratio, demonstrate quite similar behavior for terms with high frequency. Second, among all the measures, the x2 statistic is the least affected by the frequency of terms. Third, although cosine and Jaccard coefficients tend to emphasize high frequency terms, mutual information and Yule¡Çs Y seem to overestimate rare terms.
updated at: 2007/06/12 22:02:28
Carolyn J. Crouch and Bokyung Yang
Experiments in Automatic Statistical Thesaurus Construction
Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval.
pp. 77--88
1992
Experiments in Automatic Statistical Thesaurus Construction
Proceedings of the 15th annual international ACM SIGIR conference on Research and development in information retrieval.
pp. 77--88
1992
well constructed thesaurus has long been recognized as a valuable tool in the effective operation of an information retrieval system. This paper reports the results of experiments designed to determine the validity of an approach to the automatic construction of global thesauri (described originally by Crouch in [1] and [2]) based on a clustering of the document collection. The authors validate the approach by showing that the use of thesauri generated by this method results in substantial improvements in retrieval effectiveness in four test collections. The term discrimination value theory, used in the thesaurus generation algorithm to determine a term¡Çs membership in a particular thesaurus class, is found not to be useful in distinguishing between thesaurus classes (i.e., in differentiating a ¡Ègood¡É from an ¡Èindifferent¡É or ¡Èpoor¡É thesaurus class). In conclusion, the authors suggest an alternate approach to automatic thesaurus construction which greatly simplifies the work of producing viable thesaurus classes. Experimental results show that the alternate approach described herein in some cases produces thesauri which are comparable in retrieval effectiveness to those produced by the first method at much lower cost.
The discrimination value of a term is defined as a
measure of the change in space separation which occurs
when a given term is assigned to the document collection.
A good discriminator is a term which, when assigned to a
document, decreases the space density (rendering the
documents less similar to each other). A poor
discriminator, then, increases space density. By computing
the density of the document space before and after the
assignment of each term, the discrimination value of the
term can be determined.
Empirical results have shown that document frequency and
discrimination value are well correlated.
measure of the change in space separation which occurs
when a given term is assigned to the document collection.
A good discriminator is a term which, when assigned to a
document, decreases the space density (rendering the
documents less similar to each other). A poor
discriminator, then, increases space density. By computing
the density of the document space before and after the
assignment of each term, the discrimination value of the
term can be determined.
Empirical results have shown that document frequency and
discrimination value are well correlated.
updated at: 2007/06/12 15:36:38
»ûÅľ¼, µÈÅÄÌ, ÃæÀî͵»Ö
ʸ̮¾ðÊó¤Ë¤è¤ëƱµÁ¸ì¼½ñºîÀ®»Ù±ç¥Ä¡¼¥ë
IPSJ SIG Technical Report, NL176, pp. 87-94,
2006.
ʸ̮¾ðÊó¤Ë¤è¤ëƱµÁ¸ì¼½ñºîÀ®»Ù±ç¥Ä¡¼¥ë
IPSJ SIG Technical Report, NL176, pp. 87-94,
2006.
To improve the proficiency of text processing such as information retrieval or text mining, it is necessary to construct a synonym dictionary, but it is very tiresome to make it by hands. In some fields, such as aviation, synonym nouns are mixed with kanji/hiragana, katakana, alphabet and their abbreviations. As new words always come to be used, the dictionary update is a big issue. In this paper, we propose a tool for constructing a synonym dictionary. The system will return synonym candidates against a query. A synonym can be easily registered in dictionary by looking the synonym candidates. We experimented the system performance by aviation pilot report and evaluated it by average precision.
"frequency is sometimes adjusted as log(x_i + 1)" -> effective
window[2,2] was the best
spiral construction -> not better
window[2,2] was the best
spiral construction -> not better
updated at: 2007/01/23 10:01:54
Fernando Pereira, Naftali Tishby, and Lillian Lee.
Distributional clustering of English words.
In Proceedings of the 31st annual meeting of the Association for Computational Linguistics,
pp. 183-190,
1993.
Distributional clustering of English words.
In Proceedings of the 31st annual meeting of the Association for Computational Linguistics,
pp. 183-190,
1993.
Abstract: We describe and evaluate experimentally a method for clustering words according to their distribution in particular syntactic contexts. Words are represented by the relative frequency distributions of contexts in which they appear, and relative entropy between those distributions is used as the similarity measure for clustering. Clusters are represented by average context distributions derived from the given words according to their probabilities of cluster membership. In many cases, the clusters can be thought of as encoding coarse sense distinctions. Deterministic annealing is used to find lowest distortion sets of clusters: as the annealing parameter increases, existing clusters become unstable and subdivide, yielding a hierarchical "soft" clustering of the data. Clusters are used as the basis for class models of word coocurrence, and the models evaluated with respect to held-out test data.
"the relation between a transitive main verb and the head noun of its direct object."
parsed by Hindle's parser Fidditch
<<<BOOKMARK>>> read till "Distributional Similarity" on page 2.
parsed by Hindle's parser Fidditch
<<<BOOKMARK>>> read till "Distributional Similarity" on page 2.
updated at: 2007/01/20 17:10:37