نتایج جستجو برای: text linguistic
تعداد نتایج: 208900 فیلتر نتایج به سال:
This paper describes our word segmentation system and named entity recognition (NER) system for participating in the third SIGHAN Bakeoff. Both of them are based on character tagging, but use different tag sets and different features. Evaluation results show that our word segmentation system achieved 93.3% and 94.7% F-score in UPUC and MSRA open tests, and our NER system got 70.84% and 81.32% F...
This paper describes ongoing efforts at Linguistic Data Consortium to create shared resources for improved speech-totext technology. Under the DARPA EARS program, technology providers are charged with creating STT systems whose outputs are substantially richer and much more accurate than is currently possible. These aggressive program goals motivate new approaches to corpus creation and distrib...
persian natural language processing (nlp) researchers have many limitations to access linguistic tools which are suitable for text processing. therefore, researchin persian text processing is very limited. since dataset is an important requirement for experiments and their evaluation, we aimed to create appropriate corpora for information retrieval and natural language processing in persian. th...
Emdros is a text database engine for linguistic analysis or annotation of text. It is appliccable especially in corpus linguistics for storing and retrieving linguistic analyses of text, at any linguistic level. Emdros implements the EMdF text database model and the MQL query language. In this paper, I present both, and give an example of how Emdros can be useful in computational linguistics.
Statistical Machine Translation (MT) systems have achieved impressive results in recent years, due in large part to the increasing availability of parallel text for system training and development. This paper describes recent efforts at Linguistic Data Consortium to create linguistic resources for MT, including corpora, specifications and resource infrastructure. We review LDC's three-pronged a...
The growth in the use of speech corpora has benefited in the last 10 years from the establishment of data centres, such as the Linguistic Data Consortium (LDC), the European Language Resources Association (ELRA), the Japanese Language Resource Consortium (GSK: Gengo Shigen Kyouyuukikou), and multi-site annotation initiatives, such as the ToBI system for prosodic annotation and the DAMSL system ...
The 2010 NIST Speaker Recognition Evaluation continues a series of evaluations of text independent speaker detection begun in 1996. It utilizes the newly collected Mixer-6 and Greybeard Corpora from the Linguistic Data Consortium. Major test conditions to be examined include variations in channel, speech style, vocal effort, and the effect of speaker aging over a multi-year period. A new primar...
We describe efforts to create corpora to support development and evaluation of handwriting recognition and translation technology. LDC has developed a stable pipeline and infrastructures for collecting and annotating handwriting linguistic resources to support the evaluation of MADCAT and OpenHaRT. We collect handwritten samples of pre-processed Arabic and Chinese data that has been already tra...
This paper presents a second release of the ARRAU dataset: a multi-domain corpus with thorough linguistically motivated annotation of anaphora and related phenomena. Building upon the first release almost a decade ago, a considerable effort had been invested in improving the data both quantitatively and qualitatively. Thus, we have doubled the corpus size, expanded the selection of covered phen...
نمودار تعداد نتایج جستجو در هر سال
با کلیک روی نمودار نتایج را به سال انتشار فیلتر کنید