Part-of-Speech Tagging from "Small" Data Sets

نویسندگان

  • Eric Neufeld
  • Greg Adams
چکیده

Probabilistic approaches to part-of-speech (POS) tagging compile statistics from massive corpora such as the Lancaster-Oslo-Bergen (LOB) corpus. Training on a 900,000 token training corpus, the hidden Markov model (HMM) method easily achieves a 95 per cent success rate on a 100,000 token test corpus. However, even such large corpora contain relatively few words and new words are subsequently encountered in test corpora. For example, the million-token LOB contains only about 45,000 diierent words, most of which occur only once or twice. We nd that 3{4 per cent of tokens in a disjoint test corpus are unseen, that is, unknown to the tagger after training, and cause a signiicant proportion of errors. A corpus representative of all possible tag sequences seems implausible enough, let alone a corpus that also represents, even in small numbers, enough of English to make the problem of unseen words insigniicant. Experimental results connrm that this extreme course is not necessary. Variations on the HMM approach, including ending-based approaches, incremental learning strategies, and the use of approximate distributions, result in a tagger that tags unseen works nearly as accurately as seen words. Although probabilistic approaches to linguistic problems were attempted early this century Zipf,1932], early work was hampered by real diiculties of collecting and managing statistics, not to mention challenges to probabilistic methods in principle. Fast inexpensive computer technology and the availability of tagged electronic corpora such as the million-word Lancaster-Oslo-Bergen (LOB) Corpus Johansson,1980, Johansson et al.,1986] changed this situation, and probabilistic approaches to a variety of natural language processing problems have been popular for some time. A striking success of the probabilistic approach has been the use of hidden Markov models (HMMs) to attach part-of-speech (POS) tags to unrestricted text, for example, Kupiec,1992]. Given an actual stream of text, a sequence of tokens (instances of words) w 1 : : : w n , the HMM method computes the word-tag sequence (or simply tag sequence) t 1 : : : t n that most probably generated the sequence. That is, the HMM method nds the tag sequence that maximizes (42:1) Such probabilities are impossible to collect in practice; furthermore their number is exponential in n. However, assuming that 1) the probability of a tag t i directly depends only on the tag immediately preceding it, and that 2) the probability of any word w i depends only upon the tag t i that produced it, the …

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

برچسب‌گذاری ادات سخن زبان فارسی با استفاده از مدل شبکۀ فازی

Part of speech tagging (POS tagging) is an ongoing research in natural language processing (NLP) applications. The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts of speech are also known as word classes or lexical categories. The purpose of POS tagging is determining the grammatical ...

متن کامل

An improved joint model: POS tagging and dependency parsing

Dependency parsing is a way of syntactic parsing and a natural language that automatically analyzes the dependency structure of sentences, and the input for each sentence creates a dependency graph. Part-Of-Speech (POS) tagging is a prerequisite for dependency parsing. Generally, dependency parsers do the POS tagging task along with dependency parsing in a pipeline mode. Unfortunately, in pipel...

متن کامل

سیستم برچسب گذاری اجزای واژگانی کلام در زبان فارسی

Abstract: Part-Of-Speech (POS) tagging is essential work for many models and methods in other areas in natural language processing such as machine translation, spell checker, text-to-speech, automatic speech recognition, etc. So far, high accurate POS taggers have been created in many languages. In this paper, we focus on POS tagging in the Persian language. Because of problems in Persian POS t...

متن کامل

Part-of-Speech Tagging for Twitter: Annotation, Features, and Experiments

We address the problem of part-of-speech tagging for English data from the popular microblogging service Twitter. We develop a tagset, annotate data, develop features, and report tagging results nearing 90% accuracy. The data and tools have been made available to the research community with the goal of enabling richer text analysis of Twitter and related social media data sets.

متن کامل

Joint Prediction of Morphosyntactic Categories for Fine-Grained Arabic Part-of-Speech Tagging Exploiting Tag Dictionary Information

Part-of-speech (POS) tagging for morphologically rich languages such as Arabic is a challenging problem because of their enormous tag sets. One reason for this is that in the tagging scheme for such languages, a complete POS tag is formed by combining tags from multiple tag sets defined for each morphosyntactic category. Previous approaches in Arabic POS tagging applied one model for each morph...

متن کامل

Part of Speech Tagging Using Statistical Approach for Nepali Text

Abstract—Part of Speech Tagging has always been a challenging task in the era of Natural Language Processing. This article presents POS tagging for Nepali text using Hidden Markov Model and Viterbi algorithm. From the Nepali text, annotated corpus training and testing data set are randomly separated. Both methods are employed on the data sets. Viterbi algorithm is found to be computationally fa...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1995