tag search result for 'similarity' return
James R. Curran and Marc Moens.
Improvements in Automatic Thesaurus Extraction.
In Proceedings of the Workshop of the ACL Special Interest Group on the Lexicon (SIGLEX),
pp. 59-66,
2002.
Improvements in Automatic Thesaurus Extraction.
In Proceedings of the Workshop of the ACL Special Interest Group on the Lexicon (SIGLEX),
pp. 59-66,
2002.
Abstract: The use of semantic resources is common in modern NLP systems, but methods to extract lexical semantics have only recently begun to perform well enough for practical use. We evaluate existing and new similarity metrics for thesaurus extraction, and experiment with the tradeoff between extraction performance and efficiency. We propose an approximation algorithm, based on canonical attributes and coarse- and fine-grained matching, that reduces the time complexity and execution time of thesaurus extraction with only a marginal performance penalty.
thesaurus extraction systems -> differ in the definition of "context"
used a statistical shallow parser
frequency cutoff speeds up the calculation, but doesn't decrease the performance
misc. topics: weights, measures, cutoff frequency, speed-up by canonical vectors
canonical vectors: subj+dobj+iobj, TTestLog + maximum frequency cutoff
used a statistical shallow parser
frequency cutoff speeds up the calculation, but doesn't decrease the performance
misc. topics: weights, measures, cutoff frequency, speed-up by canonical vectors
canonical vectors: subj+dobj+iobj, TTestLog + maximum frequency cutoff
updated at: 2007/07/07 17:25:42
Young Mee Chung and Jae Yun Lee.
A corpus-based approach to comparative evaluation of statistical term association measures.
Journal of the American Society for Information Science and Technology.
volume 52, issue 4, pages 283--296,
2001.
A corpus-based approach to comparative evaluation of statistical term association measures.
Journal of the American Society for Information Science and Technology.
volume 52, issue 4, pages 283--296,
2001.
Statistical association measures have been widely applied in information retrieval research, usually employing a clustering of documents or terms on the basis of their relationships. Applications of the association measures for term clustering include automatic thesaurus construction and query expansion. This research evaluates the similarity of six association measures by comparing the relationship and behavior they demonstrate in various analyses of a test corpus. Analysis techniques include comparisons of highly ranked termpairs and term clusters, analyses of the correlation among the association measures using Pearson’s correlation coefficient and MDS mapping, and an analysis of the impact of a term frequency on the association values by means of z-score. The major findings of the study are as follows: First, the most similar association measures are mutual information and Yule’s coefficient of colligation Y, whereas cosine and Jaccard coefficients, as well as x2 statistic and likelihood ratio, demonstrate quite similar behavior for terms with high frequency. Second, among all the measures, the x2 statistic is the least affected by the frequency of terms. Third, although cosine and Jaccard coefficients tend to emphasize high frequency terms, mutual information and Yule’s Y seem to overestimate rare terms.
updated at: 2007/06/12 22:02:28
Pablo Gamallo, Caroline Gasperin, Alexandre Agustini, and Gabriel P. Lopes
Syntactic-Based Methods for Measuring Word Similarity
MAUTNER V., MOUCEK R., MOUCEK K., Eds., Text, Speech, and Discourse (TSD-2001),
p. 116--125,
Berlin:Springer Verlag, 2001.
Syntactic-Based Methods for Measuring Word Similarity
MAUTNER V., MOUCEK R., MOUCEK K., Eds., Text, Speech, and Discourse (TSD-2001),
p. 116--125,
Berlin:Springer Verlag, 2001.
Abstract. This paper explores different strategies for extracting similarity relations between words from partially parsed text corpora. The strategies we have analysed do not require supervised training nor semantic information available from general lexical resources. They differ in the amount and the quality of the syntactic contexts against which words are compared. The paper presents in details the notion of syntactic context and how syntactic information could be used to extract semantic regularities of word sequences. Finally, experimental tests with Portuguese corpus demonstrate that similarity measures based on fine-grained and elaborate syntactic contexts perform better than those based on poorly defined contexts.
updated at: 2007/06/12 11:04:03
Vasileios Hatzivassiloglou and Kathleen McKeown
Toward the Automatic Identification of Adjectival Scales: Clustering Adjectives According to Meaning
In Proceedings of the 31st Annual Meeting of the ACL, pages 172--182, Columbus, Ohio.
1993.
Toward the Automatic Identification of Adjectival Scales: Clustering Adjectives According to Meaning
In Proceedings of the 31st Annual Meeting of the ACL, pages 172--182, Columbus, Ohio.
1993.
In this paper we present a method to group adjectives according to their meaning, as a first step towards the automatic identification of adjectival scales. We discuss the properties of adjectival scales and of groups of semantically related adjectives and how they imply sources of linguistic knowledge in text corpora. We describe how our system exploits this linguistic knowledge to compute a measure of similarity between two adjectives, using statistical techniques and without having access to any semantic information about the adjectives. We also show how a clustering algorithm can use these similarities to produce the groups of adjectives, and we present results produced by our system for a sample set of adjectives. We conclude by presenting evaluation methods for the task at hand, and analyzing the significance of the results obtained.
Semantic relatedness subsumes
hyponymy, synonymy, and antonymyincompatibility.
automatically grouping adjectives according
to their meaning
adjective - nouns
adjectives describing the same property tend to modify approximately the same set of nouns.
hyponymy, synonymy, and antonymyincompatibility.
automatically grouping adjectives according
to their meaning
adjective - nouns
adjectives describing the same property tend to modify approximately the same set of nouns.
updated at: 2007/06/12 09:25:31
榊剛史, 松尾豊, 内山幸樹, 石塚満.
Web上の情報を用いた関連語のシソーラス構築について.
自然言語処理, Vol. 14, Number 2, pp. 3-31,
2007.
Web上の情報を用いた関連語のシソーラス構築について.
自然言語処理, Vol. 14, Number 2, pp. 3-31,
2007.
This paper describes a method to construct related terms thesauri automatically based on Web information. We utilize Web search engine to obtain word co-occurrence information and propose a new efficient similarity metrics applying \chi^2 value to solve problems of the existing methods. We also introduce a new method to identify related terms using word-clustering. We do word-clustering on that associative network to identify related terms using latest clustering methods, "Newman method". We make evaluations and show the effectiveness of our approach using sets of related terms extracted from a corpus and a current thesaurus.
updated at: 2007/05/07 18:03:29
タグとオントロジー
神崎正英
2007-01-17立命館大学びわこ・くさつキャンパス学術公開講演会の講演資料
http://www.kanzaki.com/works/2007/pub/0117ritsumei.html
神崎正英
2007-01-17立命館大学びわこ・くさつキャンパス学術公開講演会の講演資料
http://www.kanzaki.com/works/2007/pub/0117ritsumei.html
オントロジーは分類語彙集ではなく、異なるアプリケーションをつなぐもの
updated at: 2007/05/07 17:52:52
當間雅,折原幸治,塩入寛之,梅村恭司.
関連語対のマイニングのための評価尺度.
言語処理学会第13回年次大会予稿集,B3-7,
2007.
関連語対のマイニングのための評価尺度.
言語処理学会第13回年次大会予稿集,B3-7,
2007.
updated at: 2007/04/16 17:47:54
Dekang Lin, Shaojun Zhao, Lijuan Qin and Ming Zhou.
Identifying Synonyms among Distributionally Similar Words.
In Proceedings of IJCAI-03, pp.1492-1493.
2003.
Identifying Synonyms among Distributionally Similar Words.
In Proceedings of IJCAI-03, pp.1492-1493.
2003.
There have been many proposals to compute similarities between words based on their distributions in contexts. However, these approaches do not distinguish between synonyms and antonyms. We present two methods for identifying synonyms among distributionally similar words.
updated at: 2007/02/09 13:13:21
Donald Hindle.
Noun classification from predicate-argument structures.
In 28th Annual Meeting of the Association for Computational Linguistics,
pp. 268-275,
1990.
Noun classification from predicate-argument structures.
In 28th Annual Meeting of the Association for Computational Linguistics,
pp. 268-275,
1990.
Abstract: A method of determining the similarity of nouns on the basis of a metric derived from the distribution of subject, verb and object in a large text corpus is described. The resulting quasi-semantic classification of nouns demonstrates the plausibility of the distributional hypothesis, and has potential application to a variety of tasks, including automatic indexing, resolving nominal compounds, and determining the scope of modification.
"the meaning of entities, and the meaning of grammatical relations among them, is related to the restriction of combinations of these entities relative to other entities." (Harris 1968:12).
"More is to be learned from the fact that you can drink wine than from the fact that you can drink it even though there are more clauses in our sample with it as an object of drink than with wine."
"We can define "reciprocally most similar" nouns or "reciprocal nearest neighbors" (RNN) as two nouns which are each other's most similar noun."
"More is to be learned from the fact that you can drink wine than from the fact that you can drink it even though there are more clauses in our sample with it as an object of drink than with wine."
"We can define "reciprocally most similar" nouns or "reciprocal nearest neighbors" (RNN) as two nouns which are each other's most similar noun."
updated at: 2007/01/22 16:19:20
Dekang Lin.
Automatic retrieval and clustering of similar words.
In Proceedings of the 17th International Conference on Computational Linguistics and of the 36th Annual Meeting of the Association for Computational Linguistics,
pp. 768-774,
1998.
Automatic retrieval and clustering of similar words.
In Proceedings of the 17th International Conference on Computational Linguistics and of the 36th Annual Meeting of the Association for Computational Linguistics,
pp. 768-774,
1998.
Abstract: Bootstrapping semantics from text is one of the greatest challenges in natural language learning. We first define a word similarity measure based on the distributional pattern of words. The similarity measure allows us to construct a thesaurus using a parsed corpus. We then present a new evaluation methodology for the automatically constructed thesaurus. The evaluation results show that the thesaurus is significantly closer to WordNet than Roget Thesaurus is.
"It was shown in (Dagan et al., 1997) that a similarity-based smoothing
method achieved much better results than backoff smoothing methods in
word sense disambiguation."
"The differences between Hindle and Hindle_r clearly demonstrate that
the use of other types of dependencies in addition to subject and
object relationships is very beneficial."
method achieved much better results than backoff smoothing methods in
word sense disambiguation."
"The differences between Hindle and Hindle_r clearly demonstrate that
the use of other types of dependencies in addition to subject and
object relationships is very beneficial."
updated at: 2007/01/20 17:33:48
Fernando Pereira, Naftali Tishby, and Lillian Lee.
Distributional clustering of English words.
In Proceedings of the 31st annual meeting of the Association for Computational Linguistics,
pp. 183-190,
1993.
Distributional clustering of English words.
In Proceedings of the 31st annual meeting of the Association for Computational Linguistics,
pp. 183-190,
1993.
Abstract: We describe and evaluate experimentally a method for clustering words according to their distribution in particular syntactic contexts. Words are represented by the relative frequency distributions of contexts in which they appear, and relative entropy between those distributions is used as the similarity measure for clustering. Clusters are represented by average context distributions derived from the given words according to their probabilities of cluster membership. In many cases, the clusters can be thought of as encoding coarse sense distinctions. Deterministic annealing is used to find lowest distortion sets of clusters: as the annealing parameter increases, existing clusters become unstable and subdivide, yielding a hierarchical "soft" clustering of the data. Clusters are used as the basis for class models of word coocurrence, and the models evaluated with respect to held-out test data.
"the relation between a transitive main verb and the head noun of its direct object."
parsed by Hindle's parser Fidditch
<<<BOOKMARK>>> read till "Distributional Similarity" on page 2.
parsed by Hindle's parser Fidditch
<<<BOOKMARK>>> read till "Distributional Similarity" on page 2.
updated at: 2007/01/20 17:10:37