A Maximum Entropy Approach to Chinese Word Segmentation

نویسندگان

  • Jin Kiat Low
  • Hwee Tou Ng
  • Wenyuan Guo
چکیده

We participated in the Second International Chinese Word Segmentation Bakeoff. Specifically, we evaluated our Chinese word segmenter in the open track, on all four corpora, namely Academia Sinica (AS), City University of Hong Kong (CITYU), Microsoft Research (MSR), and Peking University (PKU). Based on a maximum entropy approach, our word segmenter achieved the highest F measure for AS, CITYU, and PKU, and the second highest for MSR. We found that the use of an external dictionary and additional training corpora of different segmentation standards helped to further improve segmentation accuracy. 1 Chinese Word Segmenter The Chinese word segmenter we built is similar to the maximum entropy word segmenter we employed in our previous work (Ng and Low, 2004). Our word segmenter uses a maximum entropy framework (Ratnaparkhi, 1998; Xue and Shen, 2003) and is trained on manually segmented sentences. It classifies each Chinese character given the features derived from its surrounding context. Each Chinese character can be assigned one of four possible boundary tags: s for a character that occurs as a single-character word, b for a character that begins a multi-character (i.e., two or more characters) word, e for a character that ends a multi-character word, and m for a character that is neither the first nor last in a multi-character word. Our implementation used the opennlp maximum entropy package v2.1.0 from sourceforge.1

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Chinese Word Segmentation Based On Direct Maximum Entropy Model

Chinese word segmentation is a fundamental and important issue in Chinese information processing. In order to find a unified approach for Chinese word segmentation, the author develop a Chinese lexical analyzer PCWS using direct maximum entropy model. The paper presents the general description of PCWS, as well as the result and analysis of its performance at the Second International Chinese Wor...

متن کامل

Chinese Word Boundaries Detection Based on Maximum Entropy Model

Among the language texts in natural language, Chinese texts are written in a continuous way with ideographic characters. Unlike other western language texts such as English, Portuguese, etc., delimiters are used to specify the word boundaries. Hence, for any Chinese information processing system such as automatic question and answering, web information retrieval, text to speech conversion, mach...

متن کامل

Chinese Word Segmentation Based on an Approach of Maximum Entropy Modeling

In this paper, we described our Chinese word segmentation system for the 3rd SIGHAN Chinese Language Processing Bakeoff Word Segmentation Task. Our system deal with the Chinese character sequence by using the Maximum Entropy model, which is fully automatically generated from the training data by analyzing the character sequences from the training corpus. We analyze its performance on both close...

متن کامل

A New Approach for English-Chinese Named Entity Alignment

Traditional word alignment approaches cannot come up with satisfactory results for Named Entities. In this paper, we propose a novel approach using a maximum entropy model for named entity alignment. To ease the training of the maximum entropy model, bootstrapping is used to help supervised learning. Unlike previous work reported in the literature, our work conducts bilingual Named Entity align...

متن کامل

A Maximum Entropy Chinese Character-Based Parser

The paper presents a maximum entropy Chinese character-based parser trained on the Chinese Treebank (“CTB” henceforth). Word-based parse trees in CTB are first converted into characterbased trees, where word-level part-ofspeech (POS) tags become constituent labels and character-level tags are derived from word-level POS tags. A maximum entropy parser is then trained on the character-based corpu...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005