LORELEI Language Packs: Data, Tools, and Resources for Technology Development in Low Resource Languages

نویسندگان

  • Stephanie Strassel
  • Jennifer Tracey
چکیده

In this paper, we describe the textual linguistic resources in nearly 3 dozen languages being produced by Linguistic Data Consortium for DARPA’s LORELEI (Low Resource Languages for Emergent Incidents) Program. The goal of LORELEI is to improve the performance of human language technologies for low-resource languages and enable rapid re-training of such technologies for new languages, with a focus on the use case of deployment of resources in sudden emergencies such as natural disasters. Representative languages have been selected to provide broad typological coverage for training, and surprise incident languages for testing will be selected over the course of the program. Our approach treats the full set of language packs as a coherent whole, maintaining LORELEI-wide specifications, tag sets and guidelines, while allowing for adaptation to the specific needs created by each language. Each representative language corpus, therefore, both stands on its own as a resource for the specific language and forms part of a large multilingual resource for broader cross-language technology development.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Situational Awareness for Low Resource Languages: the LORELEI Situation Frame Annotation Task

The objective of the LORELEI Situation Frame task is to aggregate information from multiple data streams – including social media – into a comprehensive, actionable understanding of the basic facts needed to mount a response to an emerging situation. Rather than evaluating these capabilities in English, LORELEI is particularly concerned with advancing human language technology performance for l...

متن کامل

Speech recognition and keyword spotting for low-resource languages: Babel project research at CUED

Recently there has been increased interest in Automatic Speech Recognition (ASR) and Key Word Spotting (KWS) systems for low resource languages. One of the driving forces for this research direction is the IARPA Babel project. This paper describes some of the research funded by this project at Cambridge University, as part of the Lorelei team co-ordinated by IBM. A range of topics are discussed...

متن کامل

The JHU Speech LOREHLT 2017 System: Cross-Language Transfer for Situation-Frame Detection

We describe the system our team used during NIST’s LoReHLT (Low Resource Human Language Technologies) 2017 Evaluations, which evaluated document topic classification. We present a language agnostic approach combining universal acoustic modeling, evaluation-language-to-English machine translation (MT) and an English-language topic classifier. This combination requires no transcribed speech in th...

متن کامل

Translation Technology Tools and Professional Translators’ Attitudes toward Them

Today technology is an integral part of professional translation; and it is generally assumed that translators’ attitudes toward translation technology tools influence their interaction with technology (Bundgaard, 2017). Therefore, the present two-phase study seeks to shed some light on what translation technology tools are and how professional translators feel toward them. The research method ...

متن کامل

Language Resource Creation and Distribution at the Linguistic Data Consortium: A Progress Report

Changes in the supply of and demand for language resources continues to affect the role of large data centers such as the Linguistic Data Consortium (LDC) and European Language Resource Center (ELRA) within the research communities they serve. The past few years have seen increased demand for: intensively multi-modal resources, larger data sets in high-density languages and new data in low dens...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016