corpora creation

Reference Lists for the Evaluation of Term Extraction Tools

2012

Elizaveta Loginova Anita Gojun Helena Blancafort Marie Guégan Tatiana Gornostay Ulrich Heid

In this paper, we discuss practical and methodological issues of the creation of reference term lists (RTLs) for the evaluation of monolingual and bilingual term candidate extraction from comparable corpora in the domains of wind energy and mobile technology. These reference term lists are intended to serve as a ”gold standard” for the qualitative and quantitative evaluation of automatic term e...

متن کامل

Phoxsy: multi-phone segments for unit selection speech synthesis

2004

Stefan Breuer Julia Abresch

A multi-phone unit specification for unit selection speech synthesis is introduced and tested with regard to its qualitative aspects by means of a listening experiment. This different concept of unit definition aims to prevent spectral discontinuities at highly critical points of concatenation and to allow for a faster creation of speech corpora, as well as a speed-up of cost calculation and un...

متن کامل

WN-Toolkit: Automatic generation of WordNets following the expand model

2014

Antoni Oliver

This paper presents a set of methodologies and algorithms to create WordNets following the expand model. We explore dictionary and BabelNet based strategies, as well as methodologies based on the use of parallel corpora. Evaluation results for six languages are presented: Catalan, Spanish, French, German, Italian and Portuguese. Along with the methodologies and evaluation we present an implemen...

متن کامل

Automatic Community Creation for Abstractive Spoken Conversations Summarization

2017

Karan Singla Evgeny A. Stepanov Ali Orkan Bayer Giuseppe Carenini Giuseppe Riccardi

Summarization of spoken conversations is a challenging task, since it requires deep understanding of dialogs. Abstractive summarization techniques rely on linking the summary sentences to sets of original conversation sentences, i.e. communities. Unfortunately, such linking information is rarely available or requires trained annotators. We propose and experiment automatic community creation usi...

متن کامل

DBpedia Abstracts: A Large-Scale, Open, Multilingual NLP Training Corpus

2016

Martin Brümmer Milan Dojchinovski Sebastian Hellmann

The ever increasing importance of machine learning in Natural Language Processing is accompanied by an equally increasing need in large-scale training and evaluation corpora. Due to its size, its openness and relative quality, the Wikipedia has already been a source of such data, but on a limited scale. This paper introduces the DBpedia Abstract Corpus, a large-scale, open corpus of annotated W...

متن کامل

Acoustical analysis of woodwind musical instruments for virtual instrument implementation by physical modeling

2004

Panagiotis Tzevelekos Georgios Kouroupetroglou

In the present paper, we present the development of a framework of methodologies, which allow the creation of acoustic analysis, by woodwind musical instrument recordings corpora, as well as the implementation of virtual instruments, by physical modeling. We emphasize on traditional instruments, starting with the zournas. By analysis, acoustical aspects of the instrument are derived (attack-rel...

متن کامل

Exploring Reusability and Reproducibility for a Research Infrastructure for L1 and L2 Learner Corpora

Journal: :Information 2021

Up until today research in various educational and linguistic domains such as learner corpus research, writing or second language acquisition has produced a substantial amount of data the form L1 L2 corpora. However, multitude individual solutions combined with domain-inherent obstacles sharing have so far hampered comparability, reusability reproducibility results. In this article, we present ...

متن کامل

An eRulemaking Corpus: Identifying Substantive Issues in Public Comments

2008

Claire Cardie Cynthia Farina Matt Rawding Adil Aijaz

We describe the creation of a corpus that supports a real-world hierarchical text categorization task in the domain of electronic rulemaking (eRulemaking). Features of the task and of the eRulemaking domain engender both a non-traditional text categorization corpus and a correspondingly difficult machine learning task. Interannotator agreement results are presented for a group of six annotators...

متن کامل

Annotating Uncertainty in Hungarian Webtext

2014

Veronika Vincze Katalin Ilona Simkó Viktor Varga

Uncertainty detection has been a popular topic in natural language processing, which manifested in the creation of several corpora for English. Here we show how the annotation guidelines originally developed for English standard texts can be adapted to Hungarian webtext. We annotated a small corpus of Facebook posts for uncertainty phenomena and we illustrate the main characteristics of such te...

متن کامل

Discovering Location Indicators of Toponyms from News to Improve Gazetteer-Based Geo-Referencing

2008

Cleber Gouvêa Stanley Loh Luís Fernando Fortes Garcia Evandro Brasil da Fonseca Igor Wendt

This paper presents an approach that identifies Location Indicators related to geographical locations, by analyzing texts of news published in the Web. The goal is to semi-automatically create Gazetteers with the identified relations and then perform geo-referencing of news. Location Indicators include non-geographical entities that are dynamic and may change along the time. The use of news pub...

متن کامل