xcit'ed

- paper management system

xcit'ed

- paper management system

 

by matton

tag search result for 'contextual' return

search: 
add new paper
Pablo Gamallo, Caroline Gasperin, Alexandre Agustini, and Gabriel P. Lopes
Syntactic-Based Methods for Measuring Word Similarity
MAUTNER V., MOUCEK R., MOUCEK K., Eds., Text, Speech, and Discourse (TSD-2001),
p. 116--125,
Berlin:Springer Verlag, 2001.
Abstract. This paper explores different strategies for extracting similarity relations between words from partially parsed text corpora. The strategies we have analysed do not require supervised training nor semantic information available from general lexical resources. They differ in the amount and the quality of the syntactic contexts against which words are compared. The paper presents in details the notion of syntactic context and how syntactic information could be used to extract semantic regularities of word sequences. Finally, experimental tests with Portuguese corpus demonstrate that similarity measures based on fine-grained and elaborate syntactic contexts perform better than those based on poorly defined contexts.
updated at: 2007/06/12 11:04:03
James R. Curran and Marc Moens.
Scaling Context Space.
In Proceedings of the 40the Annual Meeting of the Association for Computational Linguistics (ACL),
pp. 231-238,
2002.
Abstract: Context is used in many NLP systems as an indicator of a term's syntactic and semantic function. The accuracy of the system is dependent on the quality and quantity of contextual information available to describe each term. However, the quantity variable is no longer fixed by limited corpus resources. Given fixed training time and computational resources, it makes sense for systems to invest time in extracting high quality contextual information from a fixed corpus. However, with an effectively limitless quantity of text available, extraction rate and representation size need to be considered. We use thesaurus extraction with a range of context extracting tools to demonstrate the interaction between context quantity, time and size on a corpus of 300 million words.
corpus size is not longer a limiting factor
W(L1R1), W(L12) give reasonable results
log-linear relation between corpus size and performance
"It is a phenomenon common to many NLP tasks that the quality or accuracy of a system increases loglinearly with the size of the corpus."
"it could well be that far simpler but scalable learning algorithms significantly outperform existing systems."

used 300M words corpus! (c.f. WordBank = 3.5M)
up to now people have typically worked with corpora of around one million words (up to one billion!)
thesaurus extraction is a task where success has been limited when using small corpora
updated at: 2007/05/13 09:53:53
James Curran.
From Distributional to Semantic Similarity.
PhD thesis, University of Edinburgh,
2004.

<<BOOKMARK>> read only chapter 3.

Landauer and Dumais (1997) -> argue that a 500 "character" limit is more appropriate.
"a fixed character window will select either fewer longer (and thus more informative) words or more shorter (and thus less informative) words, extracting a consistent amout of contextual information for each headword"
updated at: 2007/01/20 17:36:44