text linguistic

Chinese Word Segmentation and Named Entity Recognition by Character Tagging

2006

Kun Yu Sadao Kurohashi Hao Liu Toshiaki Nakazawa

This paper describes our word segmentation system and named entity recognition (NER) system for participating in the third SIGHAN Bakeoff. Both of them are based on character tagging, but use different tag sets and different features. Evaluation results show that our word segmentation system achieved 93.3% and 94.7% F-score in UPUC and MSRA open tests, and our NER system got 70.84% and 81.32% F...

متن کامل

Shared resources for robust speech-to-text technology

2003

Stephanie Strassel David Miller Kevin Walker Christopher Cieri

This paper describes ongoing efforts at Linguistic Data Consortium to create shared resources for improved speech-totext technology. Under the DARPA EARS program, technology providers are charged with creating STT systems whose outputs are substantially richer and much more accurate than is currently possible. These aggressive program goals motivate new approaches to corpus creation and distrib...

متن کامل

creating appropriate corpus for information retrieval and natural language processing in persian language

Journal: :international journal of information science and management 0

zahra abdolhosseini department of computer engineering, alzahra university, tehran, iran mohammad reza keyvanpour department of computer engineering, alzahra university, tehran, iran

persian natural language processing (nlp) researchers have many limitations to access linguistic tools which are suitable for text processing. therefore, researchin persian text processing is very limited. since dataset is an important requirement for experiments and their evaluation, we aimed to create appropriate corpora for information retrieval and natural language processing in persian. th...

متن کامل

Emdros - a text database engine for analyzed or annotated text

2004

Ulrik Sandborg-Petersen

Emdros is a text database engine for linguistic analysis or annotation of text. It is appliccable especially in corpus linguistics for storing and retrieving linguistic analyses of text, at any linguistic level. Emdros implements the EMdF text database model and the MQL query language. In this paper, I present both, and give an example of how Emdros can be useful in computational linguistics.

متن کامل

Enhanced Infrastructure for Creation and Collection of Translation Resources

2010

Zhiyi Song Stephanie Strassel Gary Krug Kazuaki Maeda

Statistical Machine Translation (MT) systems have achieved impressive results in recent years, due in large part to the increasing availability of parallel text for system training and development. This paper describes recent efforts at Linguistic Data Consortium to create linguistic resources for MT, including corpora, specifications and resource infrastructure. We review LDC's three-pronged a...

متن کامل

Speech annotation and corpus tools

Journal: :Speech Communication 2001

Steven Bird Jonathan Harrington

The growth in the use of speech corpora has benefited in the last 10 years from the establishment of data centres, such as the Linguistic Data Consortium (LDC), the European Language Resources Association (ELRA), the Japanese Language Resource Consortium (GSK: Gengo Shigen Kyouyuukikou), and multi-site annotation initiatives, such as the ToBI system for prosodic annotation and the DAMSL system ...

متن کامل

The NIST 2010 speaker recognition evaluation

2010

Alvin F. Martin Craig S. Greenberg

The 2010 NIST Speaker Recognition Evaluation continues a series of evaluations of text independent speaker detection begun in 1996. It utilizes the newly collected Mixer-6 and Greybeard Corpora from the Linguistic Data Consortium. Major test conditions to be examined include variations in channel, speech style, vocal effort, and the effect of speaker aging over a multi-year period. A new primar...

متن کامل

Linguistic Resources for Handwriting Recognition and Translation Evaluation

2012

Zhiyi Song Safa Ismael Stephen Grimes David S. Doermann Stephanie Strassel

We describe efforts to create corpora to support development and evaluation of handwriting recognition and translation technology. LDC has developed a stable pipeline and infrastructures for collecting and annotating handwriting linguistic resources to support the evaluation of MADCAT and OpenHaRT. We collect handwritten samples of pre-processed Arabic and Chinese data that has been already tra...

متن کامل

Strength of linguistic text evidence: A fused forensic text comparison system

Journal: :Forensic Science International 2017

متن کامل

ARRAU: Linguistically-Motivated Annotation of Anaphoric Descriptions

2016

Olga Uryupina Ron Artstein Antonella Bristot Federica Cavicchio Kepa Joseba Rodríguez Massimo Poesio

This paper presents a second release of the ARRAU dataset: a multi-domain corpus with thorough linguistically motivated annotation of anaphora and related phenomena. Building upon the first release almost a decade ago, a considerable effort had been invested in improving the data both quantitatively and qualitatively. Thus, we have doubled the corpus size, expanded the selection of covered phen...

متن کامل