Korean Part-of-speech Tagging Based on a Hidden Markov Model
نویسندگان
چکیده
In this paper, we describe a method for assigning a part-of-speech tag in Korean to each morpheme. The method is based on a hidden Markov model which can be trained without using any tagged corpus. To relax the amount of computation to process multiple observation sequences, which are extraordinarily occurred in Korean part-of-speech tagging, we develop a revised Viterbi algorithm for determining the most promising tag sequence. Experimental results show that the accuracy of the model is approximately 90% on average. The performance is no better than those of English tagging systems. This is due to the partially free word order feature of Korean and the lack of training data. However, to the best of our knowledge, this is the rst Korean POS tagging system which can be trained without using tagged corpus. Many words have ambiguous part-of-speech(POS) tags. For example, `time' in English can be a noun, a verb, or an adjective. In many cases, such ambiguities can be resolved using contextual information. For example, \Time is money.", the word`time' can be determined to a noun. The assignment of the correct POS tag to each word in a text is called part-of-speech tagging. The part-of-speech tagger is a system that selects the most appropriate POS tag for each word using the contextual information. POS tagging has been treated by several approaches; rule-based approaches, statistical approaches, neural network approaches and so on. We deal with the POS tagging problem by the statistical method which is described in terms of a Markov model. Hidden Markov modeling permits us to compute the most probable sequence of state transitions , which is the most likely sequence of POS tags for a given sentence. POS tagging in Korean has diierent aspects from that in English. In Korean, most word phrases 1 consist of more than one morpheme. In order to assign POS tags to the morphemes in each word phrase, we should morphologically analyze the word phrase in advance. One word phrase, however, may be analyzed in several diierent ways due to lexical ambiguities. Furthermore , each analyzed result may consist of the different number of morphemes. POS tagging in Korean can be done by the unit of a word phrasee12]. However , some problems in the word phrase-based tagging are the followings: Because all possible tags for word phrases cannot be predicted in advance, whenever a new tag for a word phrase is …
منابع مشابه
Hidden Markov Model-Based Korean Part-of-Speech Tagging Considering High Agglutinativity, Word-Spacing, and Lexical Correlativity
متن کامل
برچسبگذاری ادات سخن زبان فارسی با استفاده از مدل شبکۀ فازی
Part of speech tagging (POS tagging) is an ongoing research in natural language processing (NLP) applications. The process of classifying words into their parts of speech and labeling them accordingly is known as part-of-speech tagging, POS-tagging, or simply tagging. Parts of speech are also known as word classes or lexical categories. The purpose of POS tagging is determining the grammatical ...
متن کاملTAKTAG: Two-phase learning method for hybrid statistical/rule-based part-of-speech disambiguation
Both statistical and rule-based approaches to part-of-speech (POS) disambiguation have their own advantages and limitations. Especially for Korean, the narrow windows provided by hidden markov model (HMM) cannot cover the necessary lexical and longdistance dependencies for POS disambiguation. On the other hand, the rule-based approaches are not accurate and flexible to new tag-sets and language...
متن کاملAutomatic Word Spacing Using Hidden Markov Model for Refining Korean Text Corpora
This paper proposes a word spacing model using a hidden Markov model (HMM) for re ning Korean raw text corpora. Previous statistical approaches for automatic word spacing have used models that make use of inaccurate probabilities because they do not consider the previous spacing state. We consider word spacing problem as a classi cation problem such as Part-of-Speech (POS) tagging and have expe...
متن کاملSpeech enhancement based on hidden Markov model using sparse code shrinkage
This paper presents a new hidden Markov model-based (HMM-based) speech enhancement framework based on the independent component analysis (ICA). We propose analytical procedures for training clean speech and noise models by the Baum re-estimation algorithm and present a Maximum a posterior (MAP) estimator based on Laplace-Gaussian (for clean speech and noise respectively) combination in the HMM ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 1997