mizan english persian parallel corpus

Sixth International Joint Conference on Natural Language Processing Proceedings of the 11th Workshop on Asian Language Resources

2013

Laxmi Kashyap Malhar Kulkarni

Bilingual corpora play an important role as resources not only for machine translation research and development but also for studying tasks in comparative linguistics. Manual annotation of word alignments is of significance to provide a gold-standard for developing and evaluating machine translation models and comparative linguistics tasks. This paper presents research on building an English-Vi...

متن کامل

Enriching Slovene WordNet with domain-specific terms

2011

Špela Vintar Darja Fišer

The paper describes an innovative approach to expanding the domain coverage of wordnet by exploiting multiple resources. In the experiment described here we are using a large monolingual Slovene corpus of texts from the domain of informatics to harvest terminology from, and a parallel English-Slovene corpus and an online dictionary as bilingual resources to facilitate the mapping of terms to th...

متن کامل

EVBCorpus - A Multi-Layer English-Vietnamese Bilingual Corpus for Studying Tasks in Comparative Linguistics

2013

Quoc Hung Ngo Werner Winiwarter Bartholomäus Wloka

Bilingual corpora play an important role as resources not only for machine translation research and development but also for studying tasks in comparative linguistics. Manual annotation of word alignments is of significance to provide a gold-standard for developing and evaluating machine translation models and comparative linguistics tasks. This paper presents research on building an English-Vi...

متن کامل

Persian Wordnet Construction using Supervised Learning

Journal: :CoRR 2017

Zahra Mousavi Heshaam Faili

This paper presents an automated supervised method for Persian wordnet construction. Using a Persian corpus and a bi-lingual dictionary, the initial links between Persian words and Princeton WordNet synsets have been generated. These links will be discriminated later as correct or incorrect by employing seven features in a trained classification system. The whole method is just a classification...

متن کامل

Treebanks in Machine Translation

2003

Martin Čmejrek Jan Cuřín Jiří Havelka

We present an approach using treebanks in machine translation. Our experiment in Czech-English machine translation is an attempt to develop a full machine translation system based on dependency trees (Dependency Based Machine Translation, DBMT). We use the following resources: Prague Dependency Treebank, a newly created Czech-English parallel corpus of Penn Treebank, English monolingual corpus,...

متن کامل

Data Collection and IPR in Multilingual Parallel Corpora. Dutch Parallel Corpus

2010

Orphée De Clercq Maribel Montero Perez

After three years of work the Dutch Parallel Corpus (DPC) project has reached an end. The finalized corpus is a ten-million-word high-quality sentence-aligned bidirectional parallel corpus of Dutch, English and French, with Dutch as central language. In this paper we present the corpus and try to formulate some basic data collection principles, based on the work that was carried out for the pro...

متن کامل

Quantitative Methods in Corpus-Based Translation Studies

Journal: :LLC 2014

Sara Laviosa

Firstly, Lidun Hareide and Knut Hofland describe through practical advice the compilation process of The Norwegian Spanish Parallel Corpus (NSPC) created at the University of Bergen (Norway), as well as preliminary findings from ongoing and planned research based on it. The corpus is primarily constructed for research in Translation Studies, and is built to be roughly comparable to the Spanish-...

متن کامل

An Efficient Framework for Extracting Parallel Sentences from Non-Parallel Corpora

Journal: :Fundam. Inform. 2014

Cuong Hoang Anh-Cuong Le Phuong-Thai Nguyen Son Bao Pham Tu-Bao Ho

Automatically building a large bilingual corpus that contains millions of words is always a challenging task. In particular in case of low-resource languages, it is difficult to find an existing parallel corpus which is large enough for building a real statistical machine translation. However, comparable non-parallel corpora are richly available in the Internet environment, such as in Wikipedia...

متن کامل

A Fully Unsupervised Approach for Mining Parallel Data from Comparable Corpora

2010

Do Thi Ngoc Diep Laurent Besacier Eric Castelli

This paper presents an unsupervised method for extracting parallel sentence pairs from a comparable corpus. A translation system is used to mine the comparable corpus and to detect parallel sentence pairs. An iterative process is implemented not only to increase the number of extracted parallel sentence pairs but also to improve the overall quality of the translation system. A comparison betwee...

متن کامل

Implementing a BNC-Compare-able Web Corpus

2007

William H. Fletcher

This paper details the author’s plans for and progress with compiling and analyzing a new gigaword English corpus from the web to complement his BNC-based online database “Phrases in English”. This new corpus represents the principal English-speaking countries in proportion to their population and will be linguistically annotated with the CLAWS4 tagger using a PoS-tagset comparable to those of ...

متن کامل