Estimating Lexical Priors for Low-Frequency Morphologically Ambiguous Forms

نویسندگان

  • R. Harald Baayen
  • Richard Sproat
چکیده

Given a form that is previously unseen in a sufficiently large training corpus, and that is morphologically n-ways ambiguous (serves n different lexical functions) what is the best estimator for the lexical prior probabilities for the various functions of the form? We argue that the best estimator is provided by computing the relative frequencies of the various functions among the hapax legomena--the forms that occur exactly once in a corpus; in particular, a hapax-based estimator is better than one based on the proportion of the various functions among words of all frequency ranges. As we shall argue, this is because when one computes an overall measure, one is including high-frequency words, and high-frequency words tend to have idiosyncratic properties that are not at all representative of the much larger mass of(productively formed) low-frequency words. This result has potential importance for various kinds of applications requiring lexical disambiguation, including, in particular, stochastic taggers. This is especially true when some initial hand-tagging of a corpus is required:for predicting lexical priors for very low-frequer~cy morphologically ambiguous types (most of which would not occur in any given corpus), one should concentrate on tagging a good representative sample of the hapax legomena, rather than extensively tagging words of all frequency ranges.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Estimating Lexical Priors for Low-Frequency Syncretic Forms

Abstract Given a previously unseen form that is morphologically n-ways ambiguous, what is the best estimator for the lexical prior probabilities for the various functions of the form? We argue that the best estimator is provided by computing the relative frequencies of the various functions among the hapax legomena — the forms that occur exactly once in a corpus. This result has important impli...

متن کامل

The Comprehension of Acoustically Reduced Morphologically Complex Words: the Roles of Deletion, Duration, and Frquency of Occurrence

This study addresses the roles of segment deletion, durational reduction, and frequency of use in the comprehension of morphologically complex words. We report two auditory lexical decision experiments with reduced and unreduced prefixed Dutch words. We found that segment deletions as such delayed comprehension. Simultaneously, however, longer durations of the different parts of the words ap­ p...

متن کامل

Frequency Effects of Regular Past Tense Forms in English on Native Speakers’ and Second Language Learners’ Accuracy Rate and Reaction Time

There is substantial debate over the mental representation of regular past tense forms in both first language (L1) and second language (L2) processing. Specifically, the controversy revolves around the nature of morphologically complex forms such as the past tense –ed in English and how morphological structures of such forms are represented in the mental lexicon. This study focuses on the resul...

متن کامل

Stem Homograph Inhibition and Stem Allomorphy: Representing and Processing Inflected Forms in a Multilevel Lexical System

Two lexical decision experiments were carried out in Spanish in order to address questions about the processing and representation of morphologically complex words in the mental lexicon. Responses to targets (e.g., mor-os “Moors”) were found to be reliably slower and less accurate when they were preceded by stem homograph primes (mor-ir “to die”) compared to unrelated control primes (sill-a “ch...

متن کامل

Interplay between morphology and frequency in lexical access: the case of the base frequency effect.

A major issue in lexical processing concerns storage and access of lexical items. Here we make use of the base frequency effect to examine this. Specifically, reaction time to morphologically complex words (words made up of base and suffix, e.g., agree+able) typically reflects frequency of the base element (i.e., total frequency of all words in which agree appears) rather than surface word freq...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Computational Linguistics

دوره 22  شماره 

صفحات  -

تاریخ انتشار 1996