Coping with Data-sparsity in Example-based Machine Translation
نویسنده
چکیده
Data-driven Machine Translation (MT) systems have been found to require large amounts of data to function well. However, obtaining parallel texts for many languages is time-consuming, expensive and difficult. This thesis aims at improving translation quality for languages that have limited resources by making use of the available data more efficiently. Templates or generalizations of sentence-pairs where sequences of one or more words are replaced by variables are used in the translation model to handle data-sparsity challenges. Templates are built from clusters or equivalence classes that group related terms (words and phrases). As generating such clusters can be time-consuming, clusters are automatically generated by grouping terms based on their semantic-similarity, syntactical-coherence and context. Data-sparsity is also a big challenge in statistical language modeling. In many MT systems, sophisticated tools are developed to make the translation models better but they still rely heavily on a restricted-decoder which uses unreliable language models that may not be well suited for translation tasks especially in sparse-data scenarios. Templates can also be used in Language Modeling. Limited training data also increases the number of out-of-vocabulary words and reduces the quality of the translations. Many of the present MT systems either ignore these unknown words or pass them on as is to the final translation assuming that they could be proper nouns. Presence of out-of-vocabulary words and rare words in the input sentence prevents an MT system from finding longer phrasal matches and produces low quality translations due to less reliable language model estimates. Approaches in the past have suggested using stems and synonyms of OOV words as replacements. This thesis uses an algorithm to find possible replacements which are not necessarily synonyms to replace out-of-vocabulary words as well as rare words based on the context in which these words appear. The effectiveness of each of the template-based approaches both in the translation model and in the language model are demonstrated for English →Chinese and English→French. The algorithm to handle out-of-vocabulary and rare words are tested on English →French, English →Chinese and English→Haitian. A Hybrid approach combining all the techniques is also studied in English→Chinese.
منابع مشابه
A new model for persian multi-part words edition based on statistical machine translation
Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...
متن کاملMitigation of Data Sparsity in Classifier-Based Translation
The concept classifier has been used as a translation unit in speech-to-speech translation systems. However, the sparsity of the training data is the bottle neck of its effectiveness. Here, a new method based on using a statistical machine translation system has been introduced to mitigate the effects of data sparsity for training classifiers. Also, the effects of the background model which is ...
متن کاملPrediction of blood cancer using leukemia gene expression data and sparsity-based gene selection methods
Background: DNA microarray is a useful technology that simultaneously assesses the expression of thousands of genes. It can be utilized for the detection of cancer types and cancer biomarkers. This study aimed to predict blood cancer using leukemia gene expression data and a robust ℓ2,p-norm sparsity-based gene selection method. Materials and Methods: In this descriptive study, the microarray ...
متن کاملReducing the Impact of Data Sparsity in Statistical Machine Translation
Morphologically rich languages generally require large amounts of parallel data to adequately estimate parameters in a statistical Machine Translation(SMT) system. However, it is time consuming and expensive to create large collections of parallel data. In this paper, we explore two strategies for circumventing sparsity caused by lack of large parallel corpora. First, we explore the use of dist...
متن کاملA NOVEL FUZZY-BASED SIMILARITY MEASURE FOR COLLABORATIVE FILTERING TO ALLEVIATE THE SPARSITY PROBLEM
Memory-based collaborative filtering is the most popular approach to build recommender systems. Despite its success in many applications, it still suffers from several major limitations, including data sparsity. Sparse data affect the quality of the user similarity measurement and consequently the quality of the recommender system. In this paper, we propose a novel user similarity measure based...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2011