Open Resources and Tools for the Shallow Processing of Portuguese: The TagShare Project

نویسندگان

Florbela Barreto

António Branco

Eduardo Ferreira

Amália Mendes

Maria Fernanda Bacelar do Nascimento

Filipe Nunes

João Ricardo Silva

چکیده

This paper presents the TagShare project and the linguistic resources and tools for the shallow processing of Portuguese developed in its scope. These resources include a 1 million token corpus that has been accurately hand annotated with a variety of linguistic information, as well as several stateoftheart shallow processing tools capable of automatically producing that type of annotation. At present, the linguistic annotations in the corpus are sentence and paragraph boundaries, token boundaries, morphosyntactic POS categories, values of inflection features, lemmas and namedentities. Hence, the set of tools comprise a sentence chunker, a tokenizer, a POS tagger, nominal and verbal analyzers and lemmatizers, a verbal conjugator, a nominal “inflector”, and a namedentity recognizer, some of which underline several online services.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A New Method for Improving Computational Cost of Open Information Extraction Systems Using Log-Linear Model

Information extraction (IE) is a process of automatically providing a structured representation from an unstructured or semi-structured text. It is a long-standing challenge in natural language processing (NLP) which has been intensified by the increased volume of information and heterogeneity, and non-structured form of it. One of the core information extraction tasks is relation extraction wh...

متن کامل

Open Resources and Tools for the Shallow Processing of Portuguese

This paper presents the linguistic resources and tools for the shallow processing of Portuguese developed in the scope of a research intitiative at the University of Lisbon. These resources include a 1 million token corpus that has been accurately hand annotated with a variety of linguistic information, as well as several state-of-the-art shallow processing tools capable of automatically produc...

متن کامل

Real-Time Open-Domain QA on the Portuguese Web

This paper presents a system for real-time, open-domain question answering on the Web of documents written in Portuguese, prepared to handle factual questions and available as a freely accessible online service. In order to deliver candidate answers to input questions phrased in Portuguese, this system resorts to a number of shallow processing tools and question answering techniques that are sp...

متن کامل

A new model for mining method selection based on grey and TODIM methods

One of the most important steps involved in mining operations is to select an appropriate extraction method for mine resources. After choosing the extraction method, it is usually impossible to replace it with another one because it may be so expensive that implementation of the entire project could be economically impossible. Choosing a mining method depends on the geological and geometrical c...

متن کامل

UNITEX-PB, a set of flexible language resources for Brazilian Portuguese∗

This work documents the project and development of various computational linguistic resources that support the Brazilian Portuguese language according to the formal methodology used by the corpus processing system called UNITEX. The delivered resources include computational lexicons, libraries to access compressed lexicons, and additional tools to validate those resources.

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2006

Open Resources and Tools for the Shallow Processing of Portuguese: The TagShare Project

نویسندگان

چکیده

منابع مشابه

A New Method for Improving Computational Cost of Open Information Extraction Systems Using Log-Linear Model

Open Resources and Tools for the Shallow Processing of Portuguese

Real-Time Open-Domain QA on the Portuguese Web

A new model for mining method selection based on grey and TODIM methods

UNITEX-PB, a set of flexible language resources for Brazilian Portuguese∗

عنوان ژورنال:

اشتراک گذاری