Scaling an Irish FST Morphology Engine for Use on Unrestricted Text

نویسندگان

  • Elaine Uí Dhonnchadha
  • Josef van Genabith
چکیده

This paper details the steps involved in scaling-up a lexicalised finite-state morphology transducer for use on unrestricted text. Our starting point was a base-line inflectional morphology engine [1], with 81% token coverage measured against a 15 million word corpus of Irish texts [2]. Manually scaling the FST lexicon component of a morphology transducer is time-consuming, expensive and rarely, if ever, complete. In order to scale up the engine we used a combination of strategies including semi-automatic population of the finite-state lexicon from machinereadable dictionary resources and from printed resources using optical character recognition, the addition of derivational morphology and the development of morphological guessers. This paper details the coverage increase contributed by each step. The full system achieves token coverage of 93% which is extended to 100% through the use of morphological guessers.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Part-of-speech tagger for Irish using Finite-State Morphology and Constraint Grammar Disambiguation

This paper describes the methodology used to develop a part-of-speech tagger for Irish, which is used to annotate a corpus of 30 million words of text with part-of-speech tags and lemmas. The tagger is evaluated using a manually disambiguated test corpus and it currently achieves 95% accuracy on unrestricted text. To our knowledge, this is the first part-of-speech tagger for Irish.

متن کامل

Partial Dependency Parsing for Irish

In this paper we present a partial dependency parser for Irish, in which Constraint Grammar (CG) rules are used to annotate dependency relations and grammatical functions in unrestricted Irish text. Chunking is performed using a regular-expression grammar which operates on the dependency tagged sentences. As this is the first implementation of a parser for unrestricted Irish text (to our knowle...

متن کامل

Visualizing the Evaluation of Distance Measures

This paper describes the development and use of an interface for visually evaluating distance measures. The combination of multidimensional scaling plots, histograms and tables allows for different stages of overview and detail. The interdisciplinary project Rule-based search in text databases with nonstandard orthography develops a fuzzy full text search engine and uses distance measures for h...

متن کامل

A Two-level Morphological Analyser and Generator for Irish using Finite-State Transducers

Computational morphology is an important part of natural language processing. Finite-state techniques have been applied successfully in computational phonology and morphology to many of the world’s major languages. Celtic languages such as Modern Irish present challenging morphological features that to date have not been addressed using finite-state technology. This paper presents a finite-stat...

متن کامل

A Comparing between the impacts of text based indexing and folksonomy on ranking of images search via Google search engine

Background and Aim: The purpose of this study was to compare the impact of text based indexing and folksonomy in image retrieval via Google search engine. Methods: This study used experimental method. The sample is 30 images extracted from the book “Gray anatomy”. The research was carried out in 4 stages; in the first stage, images were uploaded to an “Instagram” account so the images are tagge...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005