نتایج جستجو برای: corpora creation

تعداد نتایج: 147847  

2016
Max Grüntgens Torsten Schrade

The paper discusses the inherent potential of the Semantic Web and its related technologies for humanities research. The focal point lies on the extraction of semantic relations from heterogeneous XML based scholarly corpora using a webservice based infrastructure (XTriples). Especially the creation of methodologically distinct semantic corpora stemming from data sets originating in the humanit...

2009
Budiono Hammam Riza Chairil Hakim

Parallel text is one of the most valuable resources for development of statistical machine translation systems and other NLP applications. However, manual translations are very costly, and the number of known parallel text is limited. Hence, our research started with creating and collecting a large amount of parallel text resources for Indonesian-English. We describe in this paper the creation ...

2014
Xavier Tannier

Web pages do not offer reliable metadata concerning their creation date and time. However, getting the document creation time is a necessary step for allowing to apply temporal normalization systems to web pages. In this paper, we present DCTFinder, a system that parses a web page and extracts from its content the title and the creation date of this web page. DCTFinder combines heuristic title ...

Journal: :Procesamiento del Lenguaje Natural 2003
Maximilian Bisani Antonio Bonafonte Núria Castell Elviira Hartikainen Giulio Maltese Asunción Moreno Shaunie Shammass Ute Ziegenhain

The objective of the EU-project LC-STAR (Lexica and Corpora for Speech-to-Speech Translation Components) is corpora collection and lexica creation for the purposes of Automatic Speech Recognition (ASR) and Text-to-speech (TTS) that are needed in speech-to-speech translation (SST). During the lifetime of the project (2002-2005) these lexica will be specified, built and validated. Large lexica co...

2003
David Conejero Jesús Giménez Victoria Arranz Antonio Bonafonte Neus Pascual Núria Castell Asunción Moreno

Creation of lexica and corpora for Catalan, Spanish and US-English is described. A lexicon is being created for speech recognition and synthesis including relevant information. The lexicon contains 50K common words selected to achieve a wide coverage on the chosen domains, and 50K additional entries including special application words, and proper nouns. Furthermore, a large trilingual spontaneo...

2013
Roman Grundkiewicz

There are no large error corpora for a number of languages, despite the fact that they have multiple applications in natural language processing. The main reason underlying this situation is a high cost of manual corpora creation. In this paper we present the methods of automatic extraction of various kinds of errors such as spelling, typographical, grammatical, syntactic, semantic, and stylist...

2010
Eniko Héja

This paper describes an approach based on word alignment on parallel corpora, which aims at facilitating the lexicographic work of dictionary building. Although this method has been widely used in the MT community for at least 16 years, as far as we know, it has not been applied to facilitate the creation of bilingual dictionaries for human use. The proposed corpus-driven technique, in particul...

2012
Roger Granada Lucelene Lopes Carlos Ramisch Cassia Trojahn Renata Vieira Aline Villavicencio

In this paper we present a methodology for building comparable corpus, using multilingual ontologies of a scpecific domain. This resource can be exploited to foster research on multilingual corpus-based ontology learning, population and matching. The building resource process is exemplified by the construction of annotated comparable corpora in English, Portuguese, and French. The corpora, from...

2004
Christopher Cieri Kazuaki Maeda

Advances in speech technologies increase demand for linguistic data in more languages with more sophisticated annotation. In speech recognition research, corpora of dozens of broadcast hours, hundreds of conversations and tens of millions of words of text are replaced by thousands of broadcast hours, tens of thousands of conversations and billions words of text. The scope, scale and schedule of...

نمودار تعداد نتایج جستجو در هر سال

با کلیک روی نمودار نتایج را به سال انتشار فیلتر کنید