MULTEXT: Multilingual Text Tools and Corpora

نویسندگان

  • Nancy Ide
  • Jean Véronis
چکیده

MULTEXT (Multilingual Text Tools and Corpora) is the largest project funded in the Commission of European Communities Linguistic Research and Engineering Program. The project will contribute to the development of generally usable software tools to manipulate and analyse text corpora and to create multi-lingual text corpora with structural and linguistic markup. It will attempt to establish conventions for the encoding of such corpora, building on and contributing to the preliminary recommendations of the relevant international and European standardization initiatives. MULTEXT will also work towards establishing a set of guidelines for text software development, which will be widely published in order to enable future development by others. All tools and data developed within the project will be made freely and publicly available.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

East meets West: Producing Multilingual Resources in a European Context

The EU concerted action TELRI has released a two-volume CD-ROM, which contains multilingual language resources, namely corpora, lexica, and tools for language engineering. This CD-ROM provides harmonised resources for unprecedented numbers and kinds of languages, mainly from non-EU countries, for which such resources still tend to be scarce. The first volume of the CD-ROM includes the aligned t...

متن کامل

MULTEXT-East Version 4: Multilingual Morphosyntactic Specifications, Lexicons and Corpora

The paper presents the fourth, “Mondilex” edition of the MULTEXT-East language resources, a multilingual dataset for language engineering research and development, focused on the morphosyntactic level of linguistic description. This standardised and linked set of resources covers a large number of mainly Central and Eastern European languages and includes the EAGLES-based morphosyntactic specif...

متن کامل

MULTEXT-East Version 3: Multilingual Morphosyntactic Specifications, Lexicons and Corpora

The paper presents the third edition of the MULTEXT-East language resources, a multilingual dataset for language engineering research and development. This standardised and linked set of resources covers a large number of mainly Central and Eastern European languages and includes the EAGLES-based morphosyntactic specifications, defining the features that describe word-level syntactic annotation...

متن کامل

Гармонизация Систем Помет Для Многоязычных Корпусов Посредством Решетки Понятий Harmonizing Tagsets for Multilingual Corpora via Concept Lattice

Multilingual corpora can be annotated with morphosyntactic tags by monolingual tools. However, each of the tools is typically bundled with a specific tagset. This variety of tagging schemes may be a problem for the user: InterCorp, a parallel corpus, currently offers on-line concordances in 22 languages, 11 of them tagged with 11 different tagsets.1 Fig. 1 illustrates the tagset variety using c...

متن کامل

The MULTEXT-East Morphosyntactic Specifications for Slavic Languages

Word-level morphosyntactic descriptions, such as “Ncmsn” designating a common masculine singular noun in the nominative, have been developed for all Slavic languages, yet there have been few attempts to arrive at a proposal that would be harmonised across the languages. Standardisation adds to the interchange potential of the resources, making it easier to develop multilingual applications or t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1994