mizan english persian parallel corpus

A hidden Markov model for Persian part-of-speech tagging

2011

Morteza Okhovvat Behrouz Minaei-Bidgoli

One of the important actions in the processing of languages is part-of-speech tagging. Against of this importance, although numerous models have been presented in different languages but there is few works have been done in Persian language. In this paper, a part-of-speech tagging system on Persian corpus by using hidden Markov model is proposed. Achieving to this goal, the main aspects of Pers...

متن کامل

Class Based Sense Definition Model for Word Sense Tagging and Disambiguation

2003

Tracy Lin Jason S. Chang

We present an unsupervised learning strategy for word sense disambiguation (WSD) that exploits multiple linguistic resources including a parallel corpus, a bilingual machine readable dictionary, and a thesaurus. The approach is based on Class Based Sense Definition Model (CBSDM) that generates the glosses and translations for a class of word senses. The model can be applied to resolve sense amb...

متن کامل

Development of a Japanese-English Software Manual Paralell Corpus

2009

Tatsuya Ishisaka Kazuhide Yamamoto Masao Utiyama Eiichiro Sumita

To address the shortage of Japanese-English parallel corpora, we developed a parallel corpus by collecting open source software manuals from the Web. The constructed corpus contains approximately 500 thousand sentence pairs that were aligned automatically by an existing method. We also conducted statistical machine translation (SMT) experiments with the corpus and confirmed that the corpus is u...

متن کامل

Building the Croatian-English Parallel Corpus

2000

Marko Tadic

The contribution gives a survey of procedures and formats used in building the Croatian-English parallel corpus which is being collected in the Institute of Linguistics at the Philosophical Faculty, University of Zagreb. The primary text source is newspaper Croatia Weekly which has been published from the beginning of 1998 by HIKZ (Croatian Institute for Information and Culture). After quick su...

متن کامل

The IIT Bombay English-Hindi Parallel Corpus

Journal: :CoRR 2017

Anoop Kunchukuttan Pratik Mehta Pushpak Bhattacharyya

The IIT Bombay English-Hindi corpus contains parallel corpus for English-Hindi compiled from a variety of existing sources as well as corpora developed at the Center for Indian Language Technology1, IIT Bombay over the years. The training corpus consists of sentences, phrases as well as dictionary entries, spanning many applications and domains. The details of the training corpus are shown in T...

متن کامل

Building Parallel Corpora for SMT System: A Case Study of English-Manipuri

2012

Thoudam Doren Singh

The Statistical Machine Translation (SMT) systems are developed using sentence aligned parallel corpus. The difficulty is that there is no parallel corpus at the required measure for many language pairs. The preparation of large scale parallel corpus takes time and demands the linguistics skill. In the present work, the various issues of a quality parallel corpus and a technique that extracts p...

متن کامل

IDENTIC Corpus: Morphologically Enriched Indonesian-English Parallel Corpus

2012

Septina Dian Larasati

This paper describes the creation process of an Indonesian-English parallel corpus (IDENTIC). The corpus contains 45,000 sentences collected from different sources in different genres. Several manual text preprocessing tasks, such as alignment and spelling correction, are applied to the corpus to assure its quality. We also apply language specific text processing such as tokenization on both si...

متن کامل

Improving SMT by Using Parallel Data of a Closely Related Language

2012

Petra Galuscáková Ondrej Bojar

The amount of training data in statistical machine translation critically affects translation quality. In this paper, we demonstrate how to increase translation quality for one language pair by introducing parallel data from a closely related language. Specifically, we improve English→Slovak translation using a large Czech– English parallel corpus and a shallow MT system for Czech→Slovak transl...

متن کامل

The SAWA Corpus: A Parallel Corpus English - Swahili

2009

Guy De Pauw Peter Waiganjo Wagacha Gilles-Maurice de Schryver

Research in data-driven methods for Machine Translation has greatly benefited from the increasing availability of parallel corpora. Processing the same text in two different languages yields useful information on how words and phrases are translated from a source language into a target language. To investigate this, a parallel corpus is typically aligned by linking linguistic tokens in the sour...

متن کامل

بررسی مقابله ای مقالات فارسی و انگلیسی دانشجویان بر اساس مدل تولمین: بررسی اثر روش ژانر

پایان نامه :وزارت علوم، تحقیقات و فناوری - دانشگاه پیام نور - دانشگاه پیام نور مرکز - دانشکده زبانهای خارجی 1391

فرزانه خدابنده, حسن سلیمانی, منوچهر جعفری گوهر, فاطمه همتی,

the primary goal of the current project was to examine the effect of three different treatments, namely, models with explicit instruction, models with implicit instruction, and models alone on differences between the three groups of subjects in the use of the elements of argument structures in terms of toulmins (2003) model (i.e., claim, data, counterargument claim, counterargument data, rebutt...

15 صفحه اول