tag search result for 'mutual' return
Young Mee Chung and Jae Yun Lee.
A corpus-based approach to comparative evaluation of statistical term association measures.
Journal of the American Society for Information Science and Technology.
volume 52, issue 4, pages 283--296,
2001.
A corpus-based approach to comparative evaluation of statistical term association measures.
Journal of the American Society for Information Science and Technology.
volume 52, issue 4, pages 283--296,
2001.
Statistical association measures have been widely applied in information retrieval research, usually employing a clustering of documents or terms on the basis of their relationships. Applications of the association measures for term clustering include automatic thesaurus construction and query expansion. This research evaluates the similarity of six association measures by comparing the relationship and behavior they demonstrate in various analyses of a test corpus. Analysis techniques include comparisons of highly ranked termpairs and term clusters, analyses of the correlation among the association measures using Pearson¡Çs correlation coefficient and MDS mapping, and an analysis of the impact of a term frequency on the association values by means of z-score. The major findings of the study are as follows: First, the most similar association measures are mutual information and Yule¡Çs coefficient of colligation Y, whereas cosine and Jaccard coefficients, as well as x2 statistic and likelihood ratio, demonstrate quite similar behavior for terms with high frequency. Second, among all the measures, the x2 statistic is the least affected by the frequency of terms. Third, although cosine and Jaccard coefficients tend to emphasize high frequency terms, mutual information and Yule¡Çs Y seem to overestimate rare terms.
updated at: 2007/06/12 22:02:28
Kenneth Ward Church, Patrick Hanks
Word Association Norms, Mutual Information, and Lexicography
Computational Linguistics 16(1): 22-9.
1990.
Word Association Norms, Mutual Information, and Lexicography
Computational Linguistics 16(1): 22-9.
1990.
The term word association is used in a very particular sense in the p!ycholinguistic literature. (Generally speaking, subjects respond quicker than normal to the word "nurse" if it follows a highly associated word such as "doctor.") We will extend the term to provide the basis for a statistical description of a variety of interesting linguistic phenomena, ranging from semantic rehtions of the doctor/nurse type (content word/content word) to lexico-syntactlc co-occurrence constraints between
Smaller window sizes will identify fixed expressions (idioms) and other relations that hold over short ranges; larger window sizes will highlight semantic concepts and other relationships that hold over larger scales.
(asymmetry) f(x, y) \neq f(y, x) because f(x, y) encodes linear precedence.
(asymmetry) f(x, y) \neq f(y, x) because f(x, y) encodes linear precedence.
updated at: 2007/06/11 12:30:23
Chris Ding and Hanchuan Peng.
Minimum Redundancy Feature Selection from Microarray Gene Expression Data,
Proceedings of the IEEE Computer Society Conference on Bioinformatics,
pp. 523-528, 2003.
Minimum Redundancy Feature Selection from Microarray Gene Expression Data,
Proceedings of the IEEE Computer Society Conference on Bioinformatics,
pp. 523-528, 2003.
Motivation. How to selecting a small subset out of the thousands of genes in microarray data is important for accurate classification of phenotypes. Widely used methods typically rank genes according to their differential expressions among phenotypes and pick the top-ranked genes. We observe that feature sets so obtained have certain redundancy and study methods to minimize it. Results. We propose a minimum redundancy – maximum relevance (MRMR) feature selection framework. Genes selected via MRMR provide a more balanced coverage of the space and capture broader characteristics of phenotypes. They lead to significantly improved class predictions in extensive experiments on 5 gene expression data sets: NCI, Lymphoma, Lung, Leukemia and Colon. Improvements are observed consistently among 4 classification methods: Na¾¥Áve Bayes, Linear discriminant analysis, Logistic regression and Support vector machines. Supplimentary: The top 60 MRMR genes for each of the dataset are listed in http://www.nersc.gov/~cding/MRMR/
updated at: 2007/03/06 13:30:45
Donald Hindle.
Noun classification from predicate-argument structures.
In 28th Annual Meeting of the Association for Computational Linguistics,
pp. 268-275,
1990.
Noun classification from predicate-argument structures.
In 28th Annual Meeting of the Association for Computational Linguistics,
pp. 268-275,
1990.
Abstract: A method of determining the similarity of nouns on the basis of a metric derived from the distribution of subject, verb and object in a large text corpus is described. The resulting quasi-semantic classification of nouns demonstrates the plausibility of the distributional hypothesis, and has potential application to a variety of tasks, including automatic indexing, resolving nominal compounds, and determining the scope of modification.
"the meaning of entities, and the meaning of grammatical relations among them, is related to the restriction of combinations of these entities relative to other entities." (Harris 1968:12).
"More is to be learned from the fact that you can drink wine than from the fact that you can drink it even though there are more clauses in our sample with it as an object of drink than with wine."
"We can define "reciprocally most similar" nouns or "reciprocal nearest neighbors" (RNN) as two nouns which are each other's most similar noun."
"More is to be learned from the fact that you can drink wine than from the fact that you can drink it even though there are more clauses in our sample with it as an object of drink than with wine."
"We can define "reciprocally most similar" nouns or "reciprocal nearest neighbors" (RNN) as two nouns which are each other's most similar noun."
updated at: 2007/01/22 16:19:20
Dekang Lin.
Automatic retrieval and clustering of similar words.
In Proceedings of the 17th International Conference on Computational Linguistics and of the 36th Annual Meeting of the Association for Computational Linguistics,
pp. 768-774,
1998.
Automatic retrieval and clustering of similar words.
In Proceedings of the 17th International Conference on Computational Linguistics and of the 36th Annual Meeting of the Association for Computational Linguistics,
pp. 768-774,
1998.
Abstract: Bootstrapping semantics from text is one of the greatest challenges in natural language learning. We first define a word similarity measure based on the distributional pattern of words. The similarity measure allows us to construct a thesaurus using a parsed corpus. We then present a new evaluation methodology for the automatically constructed thesaurus. The evaluation results show that the thesaurus is significantly closer to WordNet than Roget Thesaurus is.
"It was shown in (Dagan et al., 1997) that a similarity-based smoothing
method achieved much better results than backoff smoothing methods in
word sense disambiguation."
"The differences between Hindle and Hindle_r clearly demonstrate that
the use of other types of dependencies in addition to subject and
object relationships is very beneficial."
method achieved much better results than backoff smoothing methods in
word sense disambiguation."
"The differences between Hindle and Hindle_r clearly demonstrate that
the use of other types of dependencies in addition to subject and
object relationships is very beneficial."
updated at: 2007/01/20 17:33:48