xcit'ed

- paper management system

xcit'ed

- paper management system

 

by matton

tag search result for 'word' return

search: 
add new paper
Pablo Gamallo, Caroline Gasperin, Alexandre Agustini, and Gabriel P. Lopes
Syntactic-Based Methods for Measuring Word Similarity
MAUTNER V., MOUCEK R., MOUCEK K., Eds., Text, Speech, and Discourse (TSD-2001),
p. 116--125,
Berlin:Springer Verlag, 2001.
Abstract. This paper explores different strategies for extracting similarity relations between words from partially parsed text corpora. The strategies we have analysed do not require supervised training nor semantic information available from general lexical resources. They differ in the amount and the quality of the syntactic contexts against which words are compared. The paper presents in details the notion of syntactic context and how syntactic information could be used to extract semantic regularities of word sequences. Finally, experimental tests with Portuguese corpus demonstrate that similarity measures based on fine-grained and elaborate syntactic contexts perform better than those based on poorly defined contexts.
updated at: 2007/06/12 11:04:03
Kenneth Ward Church, Patrick Hanks
Word Association Norms, Mutual Information, and Lexicography
Computational Linguistics 16(1): 22-9.
1990.
The term word association is used in a very particular sense in the p!ycholinguistic literature. (Generally speaking, subjects respond quicker than normal to the word "nurse" if it follows a highly associated word such as "doctor.") We will extend the term to provide the basis for a statistical description of a variety of interesting linguistic phenomena, ranging from semantic rehtions of the doctor/nurse type (content word/content word) to lexico-syntactlc co-occurrence constraints between
Smaller window sizes will identify fixed expressions (idioms) and other relations that hold over short ranges; larger window sizes will highlight semantic concepts and other relationships that hold over larger scales.

(asymmetry) f(x, y) \neq f(y, x) because f(x, y) encodes linear precedence.
updated at: 2007/06/11 12:30:23
James R. Curran and Marc Moens.
Scaling Context Space.
In Proceedings of the 40the Annual Meeting of the Association for Computational Linguistics (ACL),
pp. 231-238,
2002.
Abstract: Context is used in many NLP systems as an indicator of a term's syntactic and semantic function. The accuracy of the system is dependent on the quality and quantity of contextual information available to describe each term. However, the quantity variable is no longer fixed by limited corpus resources. Given fixed training time and computational resources, it makes sense for systems to invest time in extracting high quality contextual information from a fixed corpus. However, with an effectively limitless quantity of text available, extraction rate and representation size need to be considered. We use thesaurus extraction with a range of context extracting tools to demonstrate the interaction between context quantity, time and size on a corpus of 300 million words.
corpus size is not longer a limiting factor
W(L1R1), W(L12) give reasonable results
log-linear relation between corpus size and performance
"It is a phenomenon common to many NLP tasks that the quality or accuracy of a system increases loglinearly with the size of the corpus."
"it could well be that far simpler but scalable learning algorithms significantly outperform existing systems."

used 300M words corpus! (c.f. WordBank = 3.5M)
up to now people have typically worked with corpora of around one million words (up to one billion!)
thesaurus extraction is a task where success has been limited when using small corpora
updated at: 2007/05/13 09:53:53
榊剛史, 松尾豊, 内山幸樹, 石塚満.
Web上の情報を用いた関連語のシソーラス構築について.
自然言語処理, Vol. 14, Number 2, pp. 3-31,
2007.

This paper describes a method to construct related terms thesauri automatically based on Web information. We utilize Web search engine to obtain word co-occurrence information and propose a new efficient similarity metrics applying \chi^2 value to solve problems of the existing methods. We also introduce a new method to identify related terms using word-clustering. We do word-clustering on that associative network to identify related terms using latest clustering methods, "Newman method". We make evaluations and show the effectiveness of our approach using sets of related terms extracted from a corpus and a current thesaurus.
updated at: 2007/05/07 18:03:29
當間雅,折原幸治,塩入寛之,梅村恭司.
関連語対のマイニングのための評価尺度.
言語処理学会第13回年次大会予稿集,B3-7,
2007.
updated at: 2007/04/16 17:47:54
工藤拓(奈良先端大), 山本薫(理化学研究所), 松本裕治(奈良先端大)
Conditional Random Fields を用いた日本語形態素解析
第161回 自然言語処理研究会
2005
This paper presents Japanese morphological analysis based on Conditional Random Fields (CRF). Previous work in CRF assumed that observation sequence (word) boundaries were fixed. However, word boundaries are not clear in Japanese, and hence a straightforward application of CRF is not possible. We show how CRF can be applied to situations where word boundary ambiguity exists. CRF offer an elegant solution to the long-standing problems in Japanese morphological analysis using HMM or MEMM. First, flexible feature designs for hierarchical tagsets become possible. Second, influences of label and length bias are minimized. The former compensate weakness in HMM, while the latter overcomes noticed problems in MEMM. We experiment with CRF, HMM, and MEMM on Japanese annotated corpora, and CRF outperform the other approaches.
updated at: 2007/03/09 08:57:57
Dekang Lin, Shaojun Zhao, Lijuan Qin and Ming Zhou.
Identifying Synonyms among Distributionally Similar Words.
In Proceedings of IJCAI-03, pp.1492-1493.
2003.
There have been many proposals to compute similarities between words based on their distributions in contexts. However, these approaches do not distinguish between synonyms and antonyms. We present two methods for identifying synonyms among distributionally similar words.
updated at: 2007/02/09 13:13:21
寺田昭, 吉田稔, 中川裕志
文脈情報による同義語辞書作成支援ツール
IPSJ SIG Technical Report, NL176, pp. 87-94,
2006.
To improve the proficiency of text processing such as information retrieval or text mining, it is necessary to construct a synonym dictionary, but it is very tiresome to make it by hands. In some fields, such as aviation, synonym nouns are mixed with kanji/hiragana, katakana, alphabet and their abbreviations. As new words always come to be used, the dictionary update is a big issue. In this paper, we propose a tool for constructing a synonym dictionary. The system will return synonym candidates against a query. A synonym can be easily registered in dictionary by looking the synonym candidates. We experimented the system performance by aviation pilot report and evaluated it by average precision.
"frequency is sometimes adjusted as log(x_i + 1)" -> effective
window[2,2] was the best
spiral construction -> not better
updated at: 2007/01/23 10:01:54
James Curran.
From Distributional to Semantic Similarity.
PhD thesis, University of Edinburgh,
2004.

<<BOOKMARK>> read only chapter 3.

Landauer and Dumais (1997) -> argue that a 500 "character" limit is more appropriate.
"a fixed character window will select either fewer longer (and thus more informative) words or more shorter (and thus less informative) words, extracting a consistent amout of contextual information for each headword"
updated at: 2007/01/20 17:36:44
Dekang Lin.
Automatic retrieval and clustering of similar words.
In Proceedings of the 17th International Conference on Computational Linguistics and of the 36th Annual Meeting of the Association for Computational Linguistics,
pp. 768-774,
1998.
Abstract: Bootstrapping semantics from text is one of the greatest challenges in natural language learning. We first define a word similarity measure based on the distributional pattern of words. The similarity measure allows us to construct a thesaurus using a parsed corpus. We then present a new evaluation methodology for the automatically constructed thesaurus. The evaluation results show that the thesaurus is significantly closer to WordNet than Roget Thesaurus is.
"It was shown in (Dagan et al., 1997) that a similarity-based smoothing
method achieved much better results than backoff smoothing methods in
word sense disambiguation."

"The differences between Hindle and Hindle_r clearly demonstrate that
the use of other types of dependencies in addition to subject and
object relationships is very beneficial."
updated at: 2007/01/20 17:33:48
Fernando Pereira, Naftali Tishby, and Lillian Lee.
Distributional clustering of English words.
In Proceedings of the 31st annual meeting of the Association for Computational Linguistics,
pp. 183-190,
1993.
Abstract: We describe and evaluate experimentally a method for clustering words according to their distribution in particular syntactic contexts. Words are represented by the relative frequency distributions of contexts in which they appear, and relative entropy between those distributions is used as the similarity measure for clustering. Clusters are represented by average context distributions derived from the given words according to their probabilities of cluster membership. In many cases, the clusters can be thought of as encoding coarse sense distinctions. Deterministic annealing is used to find lowest distortion sets of clusters: as the annealing parameter increases, existing clusters become unstable and subdivide, yielding a hierarchical "soft" clustering of the data. Clusters are used as the basis for class models of word coocurrence, and the models evaluated with respect to held-out test data.
"the relation between a transitive main verb and the head noun of its direct object."
parsed by Hindle's parser Fidditch

<<<BOOKMARK>>> read till "Distributional Similarity" on page 2.
updated at: 2007/01/20 17:10:37