linguistic corpus

Studying Linguistic Changes over 200 Years of Newspapers through Resilient Words Analysis

Journal: :Front. Digital Humanities 2017

Vincent Buntinx Cyril Bornet Frédéric Kaplan

This paper presents a methodology to analyze linguistic changes in a given textual corpus allowing to overcome two common problems related to corpus linguistics studies. One of these issues is the monotonic increase of the corpus size with time, and the other one is the presence of noise in the textual data. In addition, our method allows to better target the linguistic evolution of the corpus,...

متن کامل

Speech annotation and corpus tools

Journal: :Speech Communication 2001

Steven Bird Jonathan Harrington

The growth in the use of speech corpora has benefited in the last 10 years from the establishment of data centres, such as the Linguistic Data Consortium (LDC), the European Language Resources Association (ELRA), the Japanese Language Resource Consortium (GSK: Gengo Shigen Kyouyuukikou), and multi-site annotation initiatives, such as the ToBI system for prosodic annotation and the DAMSL system ...

متن کامل

The NIST 2010 speaker recognition evaluation

2010

Alvin F. Martin Craig S. Greenberg

The 2010 NIST Speaker Recognition Evaluation continues a series of evaluations of text independent speaker detection begun in 1996. It utilizes the newly collected Mixer-6 and Greybeard Corpora from the Linguistic Data Consortium. Major test conditions to be examined include variations in channel, speech style, vocal effort, and the effect of speaker aging over a multi-year period. A new primar...

متن کامل

Linguistic Resources for Handwriting Recognition and Translation Evaluation

2012

Zhiyi Song Safa Ismael Stephen Grimes David S. Doermann Stephanie Strassel

We describe efforts to create corpora to support development and evaluation of handwriting recognition and translation technology. LDC has developed a stable pipeline and infrastructures for collecting and annotating handwriting linguistic resources to support the evaluation of MADCAT and OpenHaRT. We collect handwritten samples of pre-processed Arabic and Chinese data that has been already tra...

متن کامل

Degrees of Orality in Speech-like Corpora: Comparative Annotation of Chat and E-mail Corpora

2010

Eckhard Bick

This paper describes and evaluates the automatic grammatical annotation of a chat and an e-mail corpus of together 117 million words, using a modular Constraint Grammar system. We discuss a number of genre-specific issues, such as emoticons and personal pronouns, and offer a linguistic comparison of the two corpora with corresponding annotations of the Europarl corpus and the spoken and written...

متن کامل

ARRAU: Linguistically-Motivated Annotation of Anaphoric Descriptions

2016

Olga Uryupina Ron Artstein Antonella Bristot Federica Cavicchio Kepa Joseba Rodríguez Massimo Poesio

This paper presents a second release of the ARRAU dataset: a multi-domain corpus with thorough linguistically motivated annotation of anaphora and related phenomena. Building upon the first release almost a decade ago, a considerable effort had been invested in improving the data both quantitatively and qualitatively. Thus, we have doubled the corpus size, expanded the selection of covered phen...

متن کامل

Digital Museum of Greek Oral History: How Dialectal Speech Corpora Remain Vivid in Class

2014

A. Sfakianaki

Dialectal variants are complete linguistic systems just like standard languages (cf. Kontosopoulos 1997, Ntinas & Zarkogianni 2009). The teaching of different linguistic varieties of a standard language gives pupils the possibility a) to be acquainted with the treasures of the expressive means of their mother language, b) to embody the mother language in a broader cultural and historical contex...

متن کامل

The MMSR bilingual and crosschannel corpora for speaker recognition research and evaluation

2004

Joseph P. Campbell Hirotaka Nakasone Christopher Cieri David Miller Kevin Walker Alvin F. Martin Mark A. Przybocki

We describe efforts to create corpora to support and evaluate systems that meet the challenge of speaker recognition in the face of both channel and language variation. In addition to addressing ongoing evaluation of speaker recognition systems, these corpora are aimed at the bilingual and crosschannel dimensions. We report on specific data collection efforts at the Linguistic Data Consortium, ...

متن کامل

SemEval-2007 Task 11: English Lexical Sample Task via English-Chinese Parallel Text

2007

Hwee Tou Ng Yee Seng Chan

We made use of parallel texts to gather training and test examples for the English lexical sample task. Two tracks were organized for our task. The first track used examples gathered from an LDC corpus, while the second track used examples gathered from a Web corpus. In this paper, we describe the process of gathering examples from the parallel corpora, the differences with similar tasks in pre...

متن کامل

Parallel Aligned Treebanks at LDC: New Challenges Interfacing Existing Infrastructures

2012

Xuansong Li Stephanie Strassel Stephen Grimes Safa Ismael Mohamed Maamouri Ann Bies Nianwen Xue

Parallel aligned treebanks (PAT) are linguistic corpora annotated with morphological and syntactic structures that are aligned at sentence as well as sub-sentence levels. They are valuable resources for improving machine translation (MT) quality. Recently, there has been an increasing demand for such data, especially for divergent language pairs. The Linguistic Data Consortium (LDC) and its aca...

متن کامل