Syntactically annotated corpora of Estonian

نویسنده

  • Heli Uibo
چکیده

Syntactically annotated corpora are needed 1) to train and test parsers and various language technological products grammar checkers, information retrievers and extractors, machine translators etc; 2) to check the agreement of existing linguistic theories with the real language usage. The corpora can be annotated on different levels of depth. In shallow syntactically annotated corpora a syntactic function is determined for every wordform; in deep syntactically annotated corpora (treebanks) also the dependency structure is determined for every sentence (graphically represented as a tree). There exists a Constraint Grammar shallow syntactic parser for Estonian, developed by K. Müürisep and T. Puolakainen. To train the parser, we have annotated texts of written Estonian (20 000 words of fiction, 6 000 words of legal text and 10 000 words of newspaper texts). By now we have extended the size of the corpus up to 200 000 words. We have also started to build two versions of Estonian treebank. A parallel corpus of 50 sentences from J.Gaarder's novel "Sophie's world" has been annotated and aligned and a Constraint Grammar plus phrase structure hybrid treebank is being developed, currently consisting of 2400 automatically generated trees, 149 of them manually revised.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Arborest – a VISL-Style Treebank Derived from an Estonian Constraint Grammar Corpus

Treebank creation is a very labor-consuming task, especially if the applications intended include machine learning, gold standard parser evaluation or teaching, since only a manually checked syntactically annotated corpus can provide optimal support for these purposes. There are, however, possibilities to make the annotation process (partly) automatic, saving (manual) annotation time and/or all...

متن کامل

Arborest – a Growing Treebank of Estonian

Treebank creation is a very labor-consuming task, especially if the applications intended include machine learning, gold standard parser evaluation or teaching, since only a manually checked syntactically annotated corpus can provide optimal support for these purposes. There are, however, possibilities to make the annotation process (partly) automatic, saving (manual) annotation time and/or all...

متن کامل

Morphologically and Syntactically Annotated Corpora of Many Languages

Annotated corpora have become a standard resource for research in both linguistics and computational processing of natural languages. Lexicographers judge word usage and distribution by occurrences in corpora; part-of-speech tags may help them narrow their queries. Grammarians may use syntactically annotated corpora (treebanks) for queries such as “show me all examples where a verb governs two ...

متن کامل

Finite Structure Query: A Tool for Querying Syntactically Annotated Corpora

Finite structure query (fsq for short) is a tool for querying syntactically annotated corpora. fsq employs a query language of high expressive power, namely full first order logic. It can be used to query arbitrary finite structures, not just trees.

متن کامل

Automatic Extraction of Verb Phrases from Annotated Corpora: A Linguistic Evaluation for Estonian

Statistically-based phrase extractors are fundamental tools for the improvement of Natural Language Processing applications designed for the new languages of the emerging countries. In this context, we will present a new architecture called SENTA (Software for the Extraction of N-ary Textual Associations) that identifies verbal phrases from lemmatized corpora. In particular, SENTA proposes a so...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2004