Unsupervised Training for Overlapping Ambiguity Resolution in Chinese Word Segmentation
نویسندگان
چکیده
This paper proposes an unsupervised training approach to resolving overlapping ambiguities in Chinese word segmentation. We present an ensemble of adapted Naïve Bayesian classifiers that can be trained using an unlabelled Chinese text corpus. These classifiers differ in that they use context words within windows of different sizes as features. The performance of our approach is evaluated on a manually annotated test set. Experimental results show that the proposed approach achieves an accuracy of 94.3%, rivaling the rule-based and supervised training methods.
منابع مشابه
Ambiguity Resolution in Chinese Word Segmentation
A new method for Chinese word segmentation named Conditional F&BMM (Forward and Backward Maximal Matching) which incorporates both bigram statistics (i.e., mutual information and difference of t-test between Chinese characters) and linguistic rules for ambiguity resolution is proposed in this paper. The key characteristics of this model are the use of: (i) statistics which can be automatically ...
متن کاملStatistical Properties of Overlapping Ambiguities in Chinese Word Segmentation and a Strategy for Their Disambiguation
Overlapping ambiguity is a major ambiguity type in Chinese word segmentation. In this paper, the statistical properties of overlapping ambiguities are intensively studied based on the observations from a very large balanced general-purpose Chinese corpus. The relevant statistics are given from different perspectives. The stability of high frequent maximal overlapping ambiguities is tested based...
متن کاملCovering Ambiguity Resolution in Chinese Word Segmentation Based on Contextual Information
Covering ambiguity is one of the two basic types of ambiguities in Chinese word segmentation. We regard its resolution as equivalent to word sense disambiguation, and make use of the classical vector space model in information retrieval to formulate the contexts of ambiguous words. A variation form of TFIDF weighting is proposed and a Chinese thesaurus is additionally utilized to cope with data...
متن کاملAn Improvement Method for The Ambiguous Fragments Discovery in Chinese Word Segmentation
Disambiguation is a difficult task in Chinese Automatic Word Segmentation, and the ambiguous fragments discovery is the foundation of the disambiguation. This article proposes a method named Bidirectional Maximum Matching and Retroversion Multiword to discover the ambiguous fragments, which can deal with the overlapping ambiguity fragments of the long precision. Some experiments show that this ...
متن کاملExploiting Unlabeled Text with Different Unsupervised Segmentation Criteria for Chinese Word Segmentation
This paper presents a novel approach to improve Chinese word segmentation (CWS) that attempts to utilize unlabeled data such as training and test data without annotation for further enhancement of the state-of-the-art performance of supervised learning. The lexical information plays the role of information transformation from unlabeled text to supervised learning model. Four types of unsupervis...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2003