Unsupervised Training for Overlapping Ambiguity Resolution in Chinese Word Segmentation

نویسندگان

Mu Li

Jianfeng Gao

Changning Huang

Jianfeng Li

چکیده

This paper proposes an unsupervised training approach to resolving overlapping ambiguities in Chinese word segmentation. We present an ensemble of adapted Naïve Bayesian classifiers that can be trained using an unlabelled Chinese text corpus. These classifiers differ in that they use context words within windows of different sizes as features. The performance of our approach is evaluated on a manually annotated test set. Experimental results show that the proposed approach achieves an accuracy of 94.3%, rivaling the rule-based and supervised training methods.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Ambiguity Resolution in Chinese Word Segmentation

A new method for Chinese word segmentation named Conditional F&BMM (Forward and Backward Maximal Matching) which incorporates both bigram statistics (i.e., mutual information and difference of t-test between Chinese characters) and linguistic rules for ambiguity resolution is proposed in this paper. The key characteristics of this model are the use of: (i) statistics which can be automatically ...

متن کامل

Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation

Overlapping ambiguity is a major ambiguity type in Chinese word segmentation. In this paper, the statistical properties of overlapping ambiguities are intensively studied based on the observations from a very large balanced general-purpose Chinese corpus. The relevant statistics are given from different perspectives. The stability of high frequent maximal overlapping ambiguities is tested based...

متن کامل

Covering Ambiguity Resolution in Chinese Word Segmentation Based on Contextual Information

Covering ambiguity is one of the two basic types of ambiguities in Chinese word segmentation. We regard its resolution as equivalent to word sense disambiguation, and make use of the classical vector space model in information retrieval to formulate the contexts of ambiguous words. A variation form of TFIDF weighting is proposed and a Chinese thesaurus is additionally utilized to cope with data...

متن کامل

An Improvement Method for The Ambiguous Fragments Discovery in Chinese Word Segmentation

Disambiguation is a difficult task in Chinese Automatic Word Segmentation, and the ambiguous fragments discovery is the foundation of the disambiguation. This article proposes a method named Bidirectional Maximum Matching and Retroversion Multiword to discover the ambiguous fragments, which can deal with the overlapping ambiguity fragments of the long precision. Some experiments show that this ...

متن کامل

Exploiting Unlabeled Text with Different Unsupervised Segmentation Criteria for Chinese Word Segmentation

This paper presents a novel approach to improve Chinese word segmentation (CWS) that attempts to utilize unlabeled data such as training and test data without annotation for further enhancement of the state-of-the-art performance of supervised learning. The lexical information plays the role of information transformation from unlabeled text to supervised learning model. Four types of unsupervis...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2003

Unsupervised Training for Overlapping Ambiguity Resolution in Chinese Word Segmentation

نویسندگان

چکیده

منابع مشابه

Ambiguity Resolution in Chinese Word Segmentation

Statistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation

Covering Ambiguity Resolution in Chinese Word Segmentation Based on Contextual Information

An Improvement Method for The Ambiguous Fragments Discovery in Chinese Word Segmentation

Exploiting Unlabeled Text with Different Unsupervised Segmentation Criteria for Chinese Word Segmentation

عنوان ژورنال:

اشتراک گذاری