Collocation and Thai Word Segmentation

نویسنده

  • Wirote Aroonmanakun
چکیده

This paper presents another approach of Thai word segmentation, which is composed of two processes : syllable segmentation and syllable merging. Syllable segmentation is done on the basis of trigram statistics. Syllable merging is done on the basis of collocation between syllables. We argue that many of word segmentation ambiguities can be resolved at the level of syllable segmentation. Since a syllable is a more well-defined unit and more consistent in analysis than a word, this approach is more reliable than other approaches that use a wordsegmented corpus. This approach can perform well at the level of accuracy 81-98% depending on the dictionary used in the segmentation.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Context Sensitive Pattern Based Segmentation: A Thai Challenge

A Thai written text is a string of symbols without explicit word boundary markup. A method for a development of a segmentation tool from a corpus of already segmented text is described. The methodology is based on the technology of competing patterns, evolved from algorithm for English hyphenation. A new UNICODE pattern generation program, OPATGEN, is used for the learning phase. We have shown ...

متن کامل

Estimating Word Translation Probabilities for Thai – English Machine Translation using EM Algorithm

Selecting the word translation from a set of target language words, one that conveys the correct sense of source word and makes more fluent target language output, is one of core problems in machine translation. In this paper we compare the 3 methods of estimating word translation probabilities for selecting the translation word in Thai – English Machine Translation. The 3 methods are (1) Metho...

متن کامل

A Multi-Aspect Comparison and Evaluation on Thai Word Segmentation Programs

Word segmentation is an important task in natural language processing, especially for languages without word boundaries, such as Thai language. Many Thai word segmentation programs have been developed. Researchers and developers in Thai documents usually spend a tremendous amount of time in studying and trying different Thai word segmentation programs. This paper presents the performance of six...

متن کامل

An Integrated Tool for Translation-Memory Maintenance

This paper presents an integrated tool to construct and maintain translation-memory for memory-based machine translation. This tool was aimed to automate constructing and validating translation-memory both in word and in phrase levels from English-Thai parallel texts. To align English-Thai words and phrases, the crucial problems that must be resolved include multiple-word-expression boundary am...

متن کامل

Word Segmentation for Urdu OCR System

This paper presents a technique for Word segmentation for the Urdu OCR system. Word segmentation or word tokenization is a preliminary task for understanding the meanings of sentences in Urdu language processing. Several techniques are available for word segmentation in other languages but not much work has been done for word segmentation of Urdu Optical Character Recognition (OCR) System. A me...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002