Leveraging Multilingual News Websites for Building a Kurdish Parallel Corpus

نویسندگان

چکیده

Machine translation has been a major motivation of development in natural language processing. Despite the burgeoning achievements creating more efficient machine systems, thanks to deep learning methods, parallel corpora have remained indispensable for progress field. In an attempt create Kurdish language, this article, we describe our approach retrieving potentially alignable news articles from multi-language websites and manually align them across dialects languages based on lexical similarity transliteration scripts. We present corpus containing 12,327 pairs two Kurdish, Sorani Kurmanji. also provide 1,797 650 English-Kurmanji English-Sorani. The is publicly available under CC BY-NC-SA 4.0 license. 1

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Building a Multilingual Parallel Subtitle Corpus

In this paper on-going work of creating an extensive multilingual parallel corpus of movie subtitles is presented. The corpus currently contains roughly 23,000 pairs of aligned subtitles covering about 2,700 movies in 29 languages. Subtitles mainly consist of transcribed speech, sometimes in a very condensed way. Insertions, deletions and paraphrases are very frequent which makes them a challen...

متن کامل

Comparing two acquisition systems for automatically building an English-Croatian parallel corpus from multilingual websites

In this paper we compare two tools for automatically harvesting bitexts from multilingual websites: bitextor and ILSP-FC. We used both tools for crawling 21 multilingual websites from the tourism domain to build a domain-specific English–Croatian parallel corpus. Different settings were tried for both tools and 10,662 unique document pairs were obtained. A sample of about 10% of them was manual...

متن کامل

Building The Sense-Tagged Multilingual Parallel Corpus

Sense-annotated parallel corpora play a crucial role in natural language processing. This paper introduces our progress in creating such a corpus for Asian languages using English as a pivot, which is the first such corpus for these languages (Chinese, Japanese and Indonesian). Two sets of tools have been developed for sequential and targeted tagging, which are also easy to be set up for any ne...

متن کامل

Building a multilingual parallel corpus for human users

We present the architecture and the current state of InterCorp, a multilingual parallel corpus centered around Czech, intended primarily for human users and consisting of written texts with a focus on fiction. Following an outline of its recent development and a comparison with some other multilingual parallel corpora we give an overview of the data collection procedure that covers text selecti...

متن کامل

Building a Parallel Multilingual Corpus (Arabic-Spanish-English)

This paper presents the results (1st phase) of the on-going research in the Computational Linguistics Laboratory at Autónoma University of Madrid (LLI-UAM) aiming at the development of a multi-lingual parallel corpus (Arabic-Spanish-English) aligned on the sentence level and tagged on the POS level. A multilingual parallel corpus which brings together Arabic, Spanish and English is a new resour...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ژورنال

عنوان ژورنال: ACM Transactions on Asian and Low-Resource Language Information Processing

سال: 2022

ISSN: ['2375-4699', '2375-4702']

DOI: https://doi.org/10.1145/3511806