نتایج جستجو برای: linguistic corpus

تعداد نتایج: 113027  

Journal: :Front. Digital Humanities 2017
Vincent Buntinx Cyril Bornet Frédéric Kaplan

This paper presents a methodology to analyze linguistic changes in a given textual corpus allowing to overcome two common problems related to corpus linguistics studies. One of these issues is the monotonic increase of the corpus size with time, and the other one is the presence of noise in the textual data. In addition, our method allows to better target the linguistic evolution of the corpus,...

Journal: :Speech Communication 2001
Steven Bird Jonathan Harrington

The growth in the use of speech corpora has benefited in the last 10 years from the establishment of data centres, such as the Linguistic Data Consortium (LDC), the European Language Resources Association (ELRA), the Japanese Language Resource Consortium (GSK: Gengo Shigen Kyouyuukikou), and multi-site annotation initiatives, such as the ToBI system for prosodic annotation and the DAMSL system ...

2010
Alvin F. Martin Craig S. Greenberg

The 2010 NIST Speaker Recognition Evaluation continues a series of evaluations of text independent speaker detection begun in 1996. It utilizes the newly collected Mixer-6 and Greybeard Corpora from the Linguistic Data Consortium. Major test conditions to be examined include variations in channel, speech style, vocal effort, and the effect of speaker aging over a multi-year period. A new primar...

2012
Zhiyi Song Safa Ismael Stephen Grimes David S. Doermann Stephanie Strassel

We describe efforts to create corpora to support development and evaluation of handwriting recognition and translation technology. LDC has developed a stable pipeline and infrastructures for collecting and annotating handwriting linguistic resources to support the evaluation of MADCAT and OpenHaRT. We collect handwritten samples of pre-processed Arabic and Chinese data that has been already tra...

2010
Eckhard Bick

This paper describes and evaluates the automatic grammatical annotation of a chat and an e-mail corpus of together 117 million words, using a modular Constraint Grammar system. We discuss a number of genre-specific issues, such as emoticons and personal pronouns, and offer a linguistic comparison of the two corpora with corresponding annotations of the Europarl corpus and the spoken and written...

2016
Olga Uryupina Ron Artstein Antonella Bristot Federica Cavicchio Kepa Joseba Rodríguez Massimo Poesio

This paper presents a second release of the ARRAU dataset: a multi-domain corpus with thorough linguistically motivated annotation of anaphora and related phenomena. Building upon the first release almost a decade ago, a considerable effort had been invested in improving the data both quantitatively and qualitatively. Thus, we have doubled the corpus size, expanded the selection of covered phen...

2014
A. Sfakianaki

Dialectal variants are complete linguistic systems just like standard languages (cf. Kontosopoulos 1997, Ntinas & Zarkogianni 2009). The teaching of different linguistic varieties of a standard language gives pupils the possibility a) to be acquainted with the treasures of the expressive means of their mother language, b) to embody the mother language in a broader cultural and historical contex...

2004
Joseph P. Campbell Hirotaka Nakasone Christopher Cieri David Miller Kevin Walker Alvin F. Martin Mark A. Przybocki

We describe efforts to create corpora to support and evaluate systems that meet the challenge of speaker recognition in the face of both channel and language variation. In addition to addressing ongoing evaluation of speaker recognition systems, these corpora are aimed at the bilingual and crosschannel dimensions. We report on specific data collection efforts at the Linguistic Data Consortium, ...

2007
Hwee Tou Ng Yee Seng Chan

We made use of parallel texts to gather training and test examples for the English lexical sample task. Two tracks were organized for our task. The first track used examples gathered from an LDC corpus, while the second track used examples gathered from a Web corpus. In this paper, we describe the process of gathering examples from the parallel corpora, the differences with similar tasks in pre...

2012
Xuansong Li Stephanie Strassel Stephen Grimes Safa Ismael Mohamed Maamouri Ann Bies Nianwen Xue

Parallel aligned treebanks (PAT) are linguistic corpora annotated with morphological and syntactic structures that are aligned at sentence as well as sub-sentence levels. They are valuable resources for improving machine translation (MT) quality. Recently, there has been an increasing demand for such data, especially for divergent language pairs. The Linguistic Data Consortium (LDC) and its aca...

نمودار تعداد نتایج جستجو در هر سال

با کلیک روی نمودار نتایج را به سال انتشار فیلتر کنید