Two Knives Cut Better Than One: Chinese Word Segmentation with Dual Decomposition
نویسندگان
چکیده
There are two dominant approaches to Chinese word segmentation: word-based and character-based models, each with respective strengths. Prior work has shown that gains in segmentation performance can be achieved from combining these two types of models; however, past efforts have not provided a practical technique to allow mainstream adoption. We propose a method that effectively combines the strength of both segmentation schemes using an efficient dual-decomposition algorithm for joint inference. Our method is simple and easy to implement. Experiments on SIGHAN 2003 and 2005 evaluation datasets show that our method achieves the best reported results to date on 6 out of 7 datasets.
منابع مشابه
Joint segmentation and named entity recognition using dual decomposition in Chinese discharge summaries.
OBJECTIVE In this paper, we focus on three aspects: (1) to annotate a set of standard corpus in Chinese discharge summaries; (2) to perform word segmentation and named entity recognition in the above corpus; (3) to build a joint model that performs word segmentation and named entity recognition. DESIGN Two independent systems of word segmentation and named entity recognition were built based ...
متن کاملBidirectional Sequence Labeling via Dual Decomposition
In this paper, we propose a bidirectional algorithm for sequence labeling to capture the influence of both the left-to-right and the right-to-left directions. We combine the optimization of two unidirectional models from opposite directions via the dual decomposition method to jointly label the input sequence. Experiments on three sequence labeling tasks (Chinese word segmentation, English POS ...
متن کاملA Subword Normalized Cut Approach to Automatic Story Segmentation of Chinese Broadcast News
This paper presents a subword normalized cut (N-cut) approach to automatic story segmentation of Chinese broadcast news (BN). We represent a speech recognition transcript using a weighted undirected graph, where the nodes correspond to sentences and the weights of edges describe inter-sentence similarities. Story segmentation is formalized as a graph-partitioning problem under the N-cut criteri...
متن کاملCan MDL Improve Unsupervised Chinese Word Segmentation?
It is often assumed that MinimumDescription Length (MDL) is a good criterion for unsupervised word segmentation. In this paper, we introduce a new approach to unsupervised word segmentation of Mandarin Chinese, that leads to segmentations whose Description Length is lower than what can be obtained using other algorithms previously proposed in the literature. Suprisingly, we show that this lower...
متن کاملEnhancing Statistical Machine Translation with Character Alignment
The dominant practice of statistical machine translation (SMT) uses the same Chinese word segmentation specification in both alignment and translation rule induction steps in building Chinese-English SMT system, which may suffer from a suboptimal problem that word segmentation better for alignment is not necessarily better for translation. To tackle this, we propose a framework that uses two di...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2014