tag search result for 'model' return
James R. Curran and Marc Moens.
Improvements in Automatic Thesaurus Extraction.
In Proceedings of the Workshop of the ACL Special Interest Group on the Lexicon (SIGLEX),
pp. 59-66,
2002.
Improvements in Automatic Thesaurus Extraction.
In Proceedings of the Workshop of the ACL Special Interest Group on the Lexicon (SIGLEX),
pp. 59-66,
2002.
Abstract: The use of semantic resources is common in modern NLP systems, but methods to extract lexical semantics have only recently begun to perform well enough for practical use. We evaluate existing and new similarity metrics for thesaurus extraction, and experiment with the tradeoff between extraction performance and efficiency. We propose an approximation algorithm, based on canonical attributes and coarse- and fine-grained matching, that reduces the time complexity and execution time of thesaurus extraction with only a marginal performance penalty.
thesaurus extraction systems -> differ in the definition of "context"
used a statistical shallow parser
frequency cutoff speeds up the calculation, but doesn't decrease the performance
misc. topics: weights, measures, cutoff frequency, speed-up by canonical vectors
canonical vectors: subj+dobj+iobj, TTestLog + maximum frequency cutoff
used a statistical shallow parser
frequency cutoff speeds up the calculation, but doesn't decrease the performance
misc. topics: weights, measures, cutoff frequency, speed-up by canonical vectors
canonical vectors: subj+dobj+iobj, TTestLog + maximum frequency cutoff
updated at: 2007/07/07 17:25:42
G. Salton, C.S. Yang, and C.T. Yu
A Theory of Term Importance in Automatic Text Analysis
A Theory of Term Importance in Automatic Text Analysis
updated at: 2007/06/14 10:56:34
Michael Collins.
Head-driven Statistical Models for Natural Language Parsing
A Dissertation in Computer and Information Science,
1999.
Head-driven Statistical Models for Natural Language Parsing
A Dissertation in Computer and Information Science,
1999.
Statistical models for parsing natural language have recently shown considerable success in broad-coverage domains. Ambiguity often leads to an input sentence having many possible parse trees; statistical approaches assign a probability to each tree, thereby ranking competing trees in order of plausibility. The probability for each candidate tree is calculated as a product of terms, each term corresponding to some sub-structure within the tree. The choice of parameterization is the choice of how to break down the tree. There are two critical questions regarding the parameterization of the problem:
updated at: 2007/06/04 16:22:07
James R. Curran and Marc Moens.
Scaling Context Space.
In Proceedings of the 40the Annual Meeting of the Association for Computational Linguistics (ACL),
pp. 231-238,
2002.
Scaling Context Space.
In Proceedings of the 40the Annual Meeting of the Association for Computational Linguistics (ACL),
pp. 231-238,
2002.
Abstract: Context is used in many NLP systems as an indicator of a term's syntactic and semantic function. The accuracy of the system is dependent on the quality and quantity of contextual information available to describe each term. However, the quantity variable is no longer fixed by limited corpus resources. Given fixed training time and computational resources, it makes sense for systems to invest time in extracting high quality contextual information from a fixed corpus. However, with an effectively limitless quantity of text available, extraction rate and representation size need to be considered. We use thesaurus extraction with a range of context extracting tools to demonstrate the interaction between context quantity, time and size on a corpus of 300 million words.
corpus size is not longer a limiting factor
W(L1R1), W(L12) give reasonable results
log-linear relation between corpus size and performance
"It is a phenomenon common to many NLP tasks that the quality or accuracy of a system increases loglinearly with the size of the corpus."
"it could well be that far simpler but scalable learning algorithms significantly outperform existing systems."
used 300M words corpus! (c.f. WordBank = 3.5M)
up to now people have typically worked with corpora of around one million words (up to one billion!)
thesaurus extraction is a task where success has been limited when using small corpora
W(L1R1), W(L12) give reasonable results
log-linear relation between corpus size and performance
"It is a phenomenon common to many NLP tasks that the quality or accuracy of a system increases loglinearly with the size of the corpus."
"it could well be that far simpler but scalable learning algorithms significantly outperform existing systems."
used 300M words corpus! (c.f. WordBank = 3.5M)
up to now people have typically worked with corpora of around one million words (up to one billion!)
thesaurus extraction is a task where success has been limited when using small corpora
updated at: 2007/05/13 09:53:53
寺田昭, 吉田稔, 中川裕志
文脈情報による同義語辞書作成支援ツール
IPSJ SIG Technical Report, NL176, pp. 87-94,
2006.
文脈情報による同義語辞書作成支援ツール
IPSJ SIG Technical Report, NL176, pp. 87-94,
2006.
To improve the proficiency of text processing such as information retrieval or text mining, it is necessary to construct a synonym dictionary, but it is very tiresome to make it by hands. In some fields, such as aviation, synonym nouns are mixed with kanji/hiragana, katakana, alphabet and their abbreviations. As new words always come to be used, the dictionary update is a big issue. In this paper, we propose a tool for constructing a synonym dictionary. The system will return synonym candidates against a query. A synonym can be easily registered in dictionary by looking the synonym candidates. We experimented the system performance by aviation pilot report and evaluated it by average precision.
"frequency is sometimes adjusted as log(x_i + 1)" -> effective
window[2,2] was the best
spiral construction -> not better
window[2,2] was the best
spiral construction -> not better
updated at: 2007/01/23 10:01:54
Fernando Pereira, Naftali Tishby, and Lillian Lee.
Distributional clustering of English words.
In Proceedings of the 31st annual meeting of the Association for Computational Linguistics,
pp. 183-190,
1993.
Distributional clustering of English words.
In Proceedings of the 31st annual meeting of the Association for Computational Linguistics,
pp. 183-190,
1993.
Abstract: We describe and evaluate experimentally a method for clustering words according to their distribution in particular syntactic contexts. Words are represented by the relative frequency distributions of contexts in which they appear, and relative entropy between those distributions is used as the similarity measure for clustering. Clusters are represented by average context distributions derived from the given words according to their probabilities of cluster membership. In many cases, the clusters can be thought of as encoding coarse sense distinctions. Deterministic annealing is used to find lowest distortion sets of clusters: as the annealing parameter increases, existing clusters become unstable and subdivide, yielding a hierarchical "soft" clustering of the data. Clusters are used as the basis for class models of word coocurrence, and the models evaluated with respect to held-out test data.
"the relation between a transitive main verb and the head noun of its direct object."
parsed by Hindle's parser Fidditch
<<<BOOKMARK>>> read till "Distributional Similarity" on page 2.
parsed by Hindle's parser Fidditch
<<<BOOKMARK>>> read till "Distributional Similarity" on page 2.
updated at: 2007/01/20 17:10:37