Corpus assisted development of a Hungarian morphological analyser and guesser
نویسندگان
چکیده
1 Introduction Computational processing of highly inflectional languages – that typically feature a huge number of possible word forms – relies upon an efficient morphological analysis. For this, a comprehensive morphological analyser is needed, which cannot be replaced by a simple lexicon lookup since such a lexicon should contain all word forms for the language and would be computationally intractable. Instead, an analyser should work in tandem with a base form lexicon. The analyser should have the capability of analysing all inflectional, productive derivational and compounding phenomena and it should also be capable of doing base form reduction. Morphological processing of huge corpora inevitably faces the problem of a large number of word forms whose base form is not listed in the analyser's lexicon so they cannot be analysed. In order to cope with the problem of unknown words in the corpus, a combined method can be applied featuring symbolic constraints and statistical information. The paper will describe and empirically investigate how this method can be put into practice and utilised to improve on the output of the morphological analysis. In section 2 we will give a brief description of the analyser tool as it was originally developed. Section 3 will discuss the symbolic guesser module while section 4 will describe the data used in the experiments. In section 5 we will present the experiments carried out under different settings and show how over generation of the guesser module can be tamed by using information from word form and suffix statistics gathered from a huge corpus. Conclusions and suggestions for further work will end the paper in section 6. 2 The morphological analyser Although morphological analysis is the basis for many NLP applications, especially for highly inflective languages, morphological analysers as separate NLP tools have received limited attention in the literature, most of them being commercial products 1. The morphological analyser, called HuMOR ('High speed Unification MORphology'), which is used for tagging Hungarian corpora was also developed by a Hungarian language technology company, MorphoLogic (Prószéky and Kis, 1999). It performs a classical 'item-and-arrangement' (IA) style analysis (Hockett, 1954). The input word is analysed as a sequence of morphs, each having (i) a surface form (that appears as part of the input string), (ii) a lexical form (i.e. the 'quotation form' of the morpheme) and (iii) a category label (which may contain some structured information or simply be an unstructured label). …
منابع مشابه
Combining Symbolic and Statistical Methods in Morphological Analysis and Unknown Word Guessing
Highly inflectional/agglutinative languages like Hungarian typically feature possible word forms in such a magnitude that automatic methods that provide morphosyntactic annotation on the basis of some training corpus often face the problem of data sparseness. A possible solution to this problem is to apply a comprehensive morphological analyser, which is able to analyse almost all wordforms all...
متن کاملHandling Unknown Words in Arabic FST Morphology
A morphological analyser only recognizes words that it already knows in the lexical database. It needs, however, a way of sensing significant changes in the language in the form of newly borrowed or coined words with high frequency. We develop a finite-state morphological guesser in a pipelined methodology for extracting unknown words, lemmatizing them, and giving them a priority weight for inc...
متن کاملUsing local rules for disambiguation of homographs in Hungarian corpora
The historical corpus of Hungarian contains about 20 million running words at the moment. To be able to retrieve the occurrences of the lexemes, a morphological analyser programme was developed which is able to segment the running words and identifies the lexeme and the suffixes. Over 30% of the running words can have more then one correct analysis. Therefore we are aiming to develop methods fo...
متن کاملPolish Morphological Guesser Based on a Statistical A Tergo Index
We present a direct method of construction of a morphosyntactic guesser for Polish, which is a program producing morphosyntactic descriptions for word forms unknown to the morphological analyser. The core of the method is the construction of a statistical a tergo index, in which pseudo-suffixes (endings) extracted by a statistical tree define morpho-syntactic properties of corresponding word fo...
متن کاملcomparing a statistical and a constraint - based method
In this paper we compare two competing approaches to part-of-speech tagging, statistical and constraint-based disam-biguation, using French as our test language. We imposed a time limit on our experiment: the amount of time spent on the design of our constraint system was about the same as the time we used to train and test the easy-to-implement statistical model. We describe the two systems an...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003