mizan english persian parallel corpus

The AMARA Corpus: Building Resources for Translating the Web’s Educational Content

2013

Francisco Guzman Hassan Sajjad Stephan Vogel Ahmed Abdelali

In this paper, we introduce a new parallel corpus of subtitles of educational videos: the AMARA corpus for online educational content. We crawl a multilingual collection community generated subtitles, and present the results of processing the Arabic–English portion of the data, which yields a parallel corpus of about 2.6M Arabic and 3.9M English words. We explore different approaches to align t...

متن کامل

Metadiscourse Features in Medical Research Articles: Subdisciplinary and Paradigmatic Influences in English and Persian

Journal: Journal Of Research in Applied Linguistics 2018

Ali Mohammad Fazilatfar, Hamid Allami, MohammadReza Mozayan,

Disciplinary studies on metadiscourse in academic texts have come a rather long way (since the 1980s) to afford an awareness of the ways authors strive to signal their insights into their materials as well as their audience. However, few comprehensive corpus-based studies to date have provided a starting point for shaping our understanding of subdisciplinary and paradigmatic diversities within ...

متن کامل

The United Nations Parallel Corpus v1.0

2016

Michal Ziemski Marcin Junczys-Dowmunt Bruno Pouliquen

This paper describes the creation process and statistics of the official United Nations Parallel Corpus, the first parallel corpus composed from United Nations documents published by the original data creator. The parallel corpus presented consists of manually translated UN documents from the last 25 years (1990 to 2014) for the six official UN languages, Arabic, Chinese, English, French, Russi...

متن کامل

Exploring the sawa corpus: collection and deployment of a parallel corpus English - Swahili

Journal: :Language Resources and Evaluation 2011

Guy De Pauw Peter Waiganjo Wagacha Gilles-Maurice de Schryver

Research in machine translation and corpus annotation has greatly benefited from the increasing availability of word-aligned parallel corpora. This paper presents ongoing research on the development and application of the SAWA corpus, a two-million-word parallel corpus English—Swahili. We describe the data collection phase and zero in on the difficulties of finding appropriate and easily access...

متن کامل

Towards English - Swahili Machine Translation

2010

Guy De Pauw Peter Waiganjo Wagacha Gilles-Maurice de Schryver

Even though the Bantu language of Swahili is spoken by more than fifty million people in East and Central Africa, it is surprisingly resource-scarce from a language technological point of view, an unfortunate situation that holds for most, if not all languages on the continent. The increasing amount of digitally available, vernacular data has prompted researchers to investigate the applicabilit...

متن کامل

Evaluation of Corpus Assisted Spanish Learning

2013

Hui-Chuan Lu Yu-Hsin Chu

In the development of corpus linguistics, the creation of corpora has had a critical role in corpus-based studies. The majority of created corpora have been associated with English and native languages, while other languages and types of corpora have received relatively less attention. Because an increasing number of corpora have been constructed, and each corpus is constructed for a definite p...

متن کامل

Event and Event Actor Alignment in Phrase Based Statistical Machine Translation

2013

Anup Kumar Kolya Santanu Pal Asif Ekbal Sivaji Bandyopadhyay

This paper proposes the impacts of event and event actor alignment in English and Bengali phrase based Statistical Machine Translation (PB-SMT) System. Initially, events and event actors are identified from English and Bengali parallel corpus. For events and event actor identification in English we proposed a hybrid technique and it was carried out within the TimeML framework. Events in Bengali...

متن کامل

UM-Corpus: A Large English-Chinese Parallel Corpus for Statistical Machine Translation

2014

Liang Tian Derek F. Wong Lidia S. Chao Paulo Quaresma Francisco Oliveira Lu Yi

Parallel corpus is a valuable resource for cross-language information retrieval and data-driven natural language processing systems, especially for Statistical Machine Translation (SMT). However, most existing parallel corpora to Chinese are subject to in-house use, while others are domain specific and limited in size. To a certain degree, this limits the SMT research. This paper describes the ...

متن کامل

Unsupervised Part of Speech Tagging for Persian

2012

Tayebeh Mosavi Miangah Ali Delavar khalafi

In this paper we present a rather novel unsupervised method for part of speech (below POS) disambiguation which has been applied to Persian. This method known as Iterative Improved Feedback (IIF) Model, which is a heuristic one, uses only a raw corpus of Persian as well as all possible tags for every word in that corpus as input. During the process of tagging, the algorithm passes through sever...

متن کامل

Using a Multimedia Parallel Corpus to Investigate English-Galician Subtitling

2011

Patricia Sotelo Dios

This paper presents an ongoing research project that involves the compilation and exploitation of the multimedia corpus of subtitled films Veiga as a method to investigate the practice of English intralingual subtitling and English-Galician interlingual subtitling. Our project draws on recent work in corpus-based translation studies and its applications in the field of audiovisual translation a...

متن کامل