Division of Spanish Words into Morphemes with a Genetic Algorithm

نویسندگان

  • Alexander F. Gelbukh
  • Grigori Sidorov
  • Diego Lara-Reyes
  • Liliana Chanona-Hernández
چکیده

We discuss an unsupervised technique for determining morpheme structure of words in an inflective language, with Spanish as a case study. For this, we use a global optimization (implemented with a genetic algorithm), while most of the previous works are based on heuristics calculated using conditional probabilities of word parts. Thus, we deal with complete space of solutions and do not reduce it with the risk to eliminate some correct solutions beforehand. Also, we are working at the derivative level as contrasted with the more traditional grammatical level interested only in flexions. The algorithm works as follows. The input data is a wordlist built on the base of a large dictionary or corpus in the given language and the output data is the same wordlist with each word divided into morphemes. First, we build a redundant list of all strings that might possibly be prefixes, suffixes, and stems of the words in the wordlist. Then, we detect possible paradigms in this set and filter out all items from the lists of possible prefixes and suffixes (though not stems) that do not participate in such paradigms. Finally, a subset of those lists of possible prefixes, stems, and suffixes is chosen using the genetic algorithm. The fitness function is based on the ideas of minimum length description, i.e. we choose the minimum number of elements that are necessary for covering all the words. The obtained subset is used for dividing the words from the wordlist. Algorithm parameters are presented. Preliminary evaluation of the experimental results for a dictionary of Spanish is given.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Collaboration space division in collaborative product development based on a genetic algorithm

The advance in the global environment, rapidly changing markets, and information technology has created a new stage for design. In such an environment, one strategy for success is the Collaborative Product Development (CPD). Organizing people effectively is the goal of Collaborative Product Development, and it solves the problem with certain foreseeability. The development group activities are ...

متن کامل

Unsupervised Morphemes Segmentation

In this work, we describe the algorithm adopted to split the words into smallest possible meaningful units or morphemes. The algorithm is unsupervised and not dependent on any language. The model is developed using English language. However, the linguistic rules specific to English language are not implemented. The algorithm focuses on the identification of smallest units of words based on thei...

متن کامل

Morphemes and POS tags for n-gram based evaluation metrics

We propose the use of morphemes for automatic evaluation of machine translation output, and systematically investigate a set of F score and BLEU score based metrics calculated on words, morphemes and POS tags along with all corresponding combinations. Correlations between the new metrics and human judgments are calculated on the data of the third, fourth and fifth shared tasks of the Statistica...

متن کامل

Morphemes as Necessary Concept for Structures Discovery from Untagged Corpora

This paper describes an overview of a method which allows discovery of syntactic structures from untagged corpora. It is composed of three main steps: the discovery of the grammatical morphemes of the language. Then the construction of the chunks which axe a multilingual conceptual level allowing the bypass of the limping notion of words. And Finally the discovery of the relations between chunk...

متن کامل

MORPHEMIA: a semi-supervised algorithm for the segmentation of modern Greek words into morphemes

The present paper reports on MORPHEMIA, a semi-supervised machine-learning algorithm designed to segment Modern Greek (MG) words into morphemes. The algorithm segments its input iteratively. During its first iteration, the algorithm uses its a priori linguistic knowledge. At the end of each successful iteration, the algorithm extracts new morphological knowledge which is utilised during its nex...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008