One Sense per Collocation

نویسنده

David Yarowsky

چکیده

Previous work [Gale, Church and Yarowsky, 1992] showed that with high probability a polysemous word has one sense per discourse. In this paper we show that for certain definitions of collocation, a polysemous word exhibits essentially only one sense per collocation. We test this empirical hypothesis for several definitions of sense and collocation, and discover that it holds with 90-99% accuracy for binary ambiguities. We utilize this property in a disambiguation algorithm that achieves precision of 92% using combined models of very local context. 1. I N T R O D U C T I O N The use of collocations to resolve lexical ambiguities is certainly not a new idea. The first approaches to sense disambiguation, such as [Kelly and Stone 1975], were based on simple hand-built decision tables consisting almost exclusively of questions about observed word associations in specific positions. Later work from the AI community relied heavily upon selectional restrictions for verbs, although primarily in terms of features exhibited by their arguments (such as +DRINKABLE) rather than in terms of individual words or word classes. More recent work [Brown et al. 1991][Hearst 1991] has utilized a set of discrete local questions (such as word-to-the-right) in the development of statistical decision procedures. However, a strong trend in recent years is to treat a reasonably wide context window as an unordered bag of independent evidence points. This technique from information retrieval has been used in neural networks, Bayesian discriminators, and dictionary definition matching. In a comparative paper in this volume [Leacock et al. 1993], all three methods under investigation used words in wide context as a pool of evidence independent of relative position. It is perhaps not a coincidence that this work has focused almost exclusively on nouns, as will be shown in Section 6.2. In this study we will return again to extremely local sources of evidence, and show that models of discrete syntactic relationships have considerable advantages. *This research was supported by an NDSEG Fellowship and by DARPA grant N00014-90-J-1863. The author is also affiliated with the Linguistics Research Department of AT&T Bell Laboratories, and greatly appreciates the use of its resources in support of this work. He would also like to thank Eric Bfill, Bill Gale, Libby Levison, Mitch Marcus and Philip Resnik for their valuable feedback. 2. D E F I N I T I O N S O F S E N S E The traditional definition of word sense is "One of several meanings assigned to the same orthographic string". As meanings can always be partitioned into multiple refinements, senses are typically organized in a tree such as one finds in a dictionary. In the extreme case, one could continue making refinements until a word has a slightly different sense every time it is used. If so, the title of this paper is a tautology. However, the studies in this paper are focused on the sense distinctions at the top of the tree. A good working definition of the distinctions considered are those meanings which are not typically translated to the same word in a foreign language. Therefore, one natural type of sense distinction to consider are those words in English which indeed have multiple translations in a language such as French. As is now standard in the field, we use the Canadian Hansards, a parallel bilingual corpus, to provide sense tags in the form of French translations. Unfortunately, the Hansards are highly skewed in their sense distributions, and it is difficult to find words for which there are adequate numbers of a second sense. More diverse large bilingual corpora are not yet readily available. We also use data sets which have been hand-tagged by native English speakers. To make the selection of sense distinctions more objective, we use words such as bass where the sense distinctions (fish and musical instrument) correspond to pronunciation differences ([b~es] and [beIs]). Such data is often problematic, as the tagging is potentially subjective and error-filled, and sufficient quantities are difficult to obtain. As a solution to the data shortages for the above methods, [Gale, Church and Yarowsky 1992b] proposed the use of "pseudo-words," artificial sense ambiguities created by taking two English words with the same part of speech (such as guerilla and reptile), and replacing each instance of both in a corpus with a new polysemous word guerrilla~reptile. As it is entirely possible that the concepts guerrilla and reptile are represented by the same orthographic string in some foreign language, choosing between these two meanings based on context is a problem a word sense disambiguation algorithm could easily face. "Pseudo-words" are very useful for developing and testing disambiguation methods because of their nearly unlimited availability and the known, fully reliable

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

One Sense per Collocation and Genre/Topic Variations

This paper revisits the one sense per collocation hypothesis using fine-grained sense distinctions and two different corpora. We show that the hypothesis is weaker for fine-grained sense distinctions (70% vs. 99% reported earlier on 2-way ambiguities). We also show that one sense per collocation does hold across corpora, but that collocations vary from one corpus to the other, following genre a...

متن کامل

"One Entity per Discourse" and "One Entity per Collocation" Improve Named-Entity Disambiguation

The “one sense per discourse” (OSPD) and “one sense per collocation” (OSPC) hypotheses have been very influential in Word Sense Disambiguation. The goal of this paper is twofold: (i) to explore whether these hypotheses hold for entities, that is, whether several mentions in the same discourse (or the same collocation) tend to refer to the same entity or not, and (ii) test their impact in Named-...

متن کامل

Word Sense Induction: Triplet-Based Clustering and Automatic Evaluation

In this paper a novel solution to automatic and unsupervised word sense induction (WSI) is introduced. It represents an instantiation of the ‘one sense per collocation’ observation (Gale et al., 1992). Like most existing approaches it utilizes clustering of word co-occurrences. This approach differs from other approaches to WSI in that it enhances the effect of the one sense per collocation obs...

متن کامل

Disambiguating Noun Compounds

This paper is concerned with the interaction between word sense disambiguation and the interpretation of noun compounds (NCs) in English. We develop techniques for disambiguating word sense specifically in NCs, and then investigate whether word sense information can aid in the semantic relation interpretation of NCs. To disambiguate word sense, we combine the one sense per collocation heuristic...

متن کامل

The Sense Boundary Decision and the Sense Labeling from Collocation Clustering

This paper discusses the deciding practical sense boundary of homonymous words. One of the serious problems in making dictionaries or thesauri is in the vague boundary of senses. This also becomes a bottleneck in sense disambiguation for practical language processing systems. This paper proposes a deciding method for sense boundary discovery of homonyms using collocation from large corpora and ...

متن کامل