corpora creation

Corpora Amylacea

Journal: :Journal of Neuropathology and Experimental Neurology 1996

creating appropriate corpus for information retrieval and natural language processing in persian language

Journal: :international journal of information science and management 0

zahra abdolhosseini department of computer engineering, alzahra university, tehran, iran mohammad reza keyvanpour department of computer engineering, alzahra university, tehran, iran

persian natural language processing (nlp) researchers have many limitations to access linguistic tools which are suitable for text processing. therefore, researchin persian text processing is very limited. since dataset is an important requirement for experiments and their evaluation, we aimed to create appropriate corpora for information retrieval and natural language processing in persian. th...

متن کامل

Creation and Analysis of a Reading Comprehension Exercise Corpus: Towards Evaluating Meaning in Context

2012

Niels Ott Ramon Ziai

We discuss the collection and analysis of a cross-sectional and longitudinal learner corpus consisting of answers to reading comprehension questions written by adult second language learners of German. We motivate the need for such task-based learner corpora and identify the properties which make reading comprehension exercises a particularly interesting task. In terms of the creation of the co...

متن کامل

IRCAM Corpus Tools: Managing speech corpora

Journal: :TAL 2008

Grégory Beller Christophe Veaux Gilles Degottex Nicolas Obin Pierre Lanchantin Xavier Rodet

Corpus based methods are increasingly used for speech technology applications and for the development of theoretical or computer models of spoken languages. These usages range from unit selection speech synthesis to statistical modeling of speech phenomena like prosody or expressivity. In all cases, these usages require a wide range of tools for corpus creation, labeling, symbolic and acoustic ...

متن کامل

The MILE Corpus for Less Commonly Taught Languages

2006

Alison Alvarez Lori S. Levin Robert E. Frederking Simon Fung Donna Gates Jeff Good

This paper describes a small, structured English corpus that is designed for translation into Less Commonly Taught Languages (LCTLs), and a set of re-usable tools for creation of similar corpora. 1 The corpus systematically explores meanings that are known to affect morphology or syntax in the world’s languages. Each sentence is associated with a feature structure showing the elements of meanin...

متن کامل

Statistical Machine Translation with a Small Amount of Bilingual Training Data

1974

Maja Popović Hermann Ney

The performance of a statistical machine translation system depends on the size of the available task-specific bilingual training corpus. On the other hand, acquisition of a large high-quality bilingual parallel text for the desired domain and language pair requires a lot of time and effort, and, for some language pairs, is not even possible. Besides, small corpora have certain advantages like ...

متن کامل

Towards a reference corpus of web genres

2007

Marina Santini Serge Sharoff David Lee

Genres of spoken and written texts are being intensively studied from various angles, e.g., communication studies, discourse analysis, computational linguistics, without arriving at a generally accepted definition. Many corpora have been built to represent the language, but very few large corpora indicate genres, and when they do the typology of genres varies widely. For instance, the Brown cor...

متن کامل

Improving term extraction with linguistic analysis in the biomedical domain

Journal: :Research in Computing Science 2013

Wiktoria Golik Robert Bossy Zorana Ratkovic Claire Nedellec

This paper presents a linguistic-based approach to term extraction from corpora in the biomedical domain. The method is based on an analysis of terms and their context that verify linguistic constraints. It focuses on participles and prepositional complements. The purpose of our approach is to obtain terms that are relevant for knowledge acquisition applications, such as the creation and enrich...

متن کامل

Creation and Validation of Large Lexica for Speech-to-Speech Translation Purposes

2004

Hanne Fersøe Elviira Hartikainen Henk van den Heuvel Giulio Maltese Asunción Moreno Shaunie Shammass Ute Ziegenhain

This paper presents specifications and requirements for creation and validation o f large lexica that are needed in automatic Speech Recognition (ASR), Text-to-Speech (TTS) and statistical Speech-to-Speech Translation (SST) systems . The prepared language resources are created and validated within the scope o f the EU-project LC-STAR (Lexica and Corpora for Speech-toSpeech Translation Component...

متن کامل

Online Inference for Relation Extraction with a Reduced Feature Set

Journal: :CoRR 2015

Maxim Rabinovich Cédric Archambeau

Access to web-scale corpora is gradually bringing robust automatic knowledge base creation and extension within reach. To exploit these large unannotated—and extremely difficult to annotate—corpora, unsupervised machine learning methods are required. Probabilistic models of text have recently found some success as such a tool, but scalability remains an obstacle in their application, with stand...

متن کامل