Integrating Punctuation Rules and Naïve Bayesian Model for Chinese Creation Title Recognition

نویسندگان

  • Conrad Chen
  • Hsin-Hsi Chen
چکیده

Creation titles, i.e. titles of literary and/or artistic works, comprise over 7% of named entities in Chinese documents. They are the fourth large sort of named entities in Chinese other than personal names, location names, and organization names. However, they are rarely mentioned and studied before. Chinese title recognition is challenging for the following reasons. There are few internal features and nearly no restrictions in the naming style of titles. Their lengths and structures are varied. The worst of all, they are generally composed of common words, so that they look like common fragments of sentences. In this paper, we integrate punctuation rules, lexicon, and naïve Bayesian models to recognize creation titles in Chinese documents. This pioneer study shows a precision of 0.510 and a recall of 0.685 being achieved. The promising results can be integrated into Chinese segmentation, used to retrieve relevant information for specific titles, and so on.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Title generation for spoken broadcast news using a training corpus

The problem of title generation involves finding the essence of a document and expressing it in only a few words. The results of a query to the Informedia Digital Video Library are summarized through an automatically generated title for each retrieved news story. When the document is errorful, as with speech-recognized broadcast news stories, the title creation challenge becomes even greater. W...

متن کامل

A Hierarchical Parsing Approach with Punctuation Processing for Long Chinese Sentences

(National Laboratory of Pattern Recognition, Institute of Automation, Chinese Academy of Sciences, Beijing 100080, China) Abstract: Based on the analysis of the usage and the syntactic function of Chinese punctuations, this paper proposes a new hierarchical approach to parsing the long Chinese sentences. In traditional parsing approaches, the parsing procedure is performed on one-level and the ...

متن کامل

Studying of Classifying Chinese Sms Messages Based on Bayesian Classification

Although there are a lot of researches about e-mail spam filters, only a few focus on the issue for SMS (Short Message Service) system, especially in Chinese. In this paper, we proposed a two-layer filter model based on Naïve Bayes classifier utilizing both some traditional filter rules and content filter technical. The experimental results illustrate that the two-layer filter model can enhance...

متن کامل

ارتقای کیفیت دسته‌بندی متون با استفاده از کمیته‌ دسته‌بند دو سطحی

Nowadays, the automated text classification has witnessed special importance due to the increasing availability of documents in digital form and ensuing need to organize them. Although this problem is in the Information Retrieval (IR) field, the dominant approach is based on machine learning techniques. Approaches based on classifier committees have shown a better performance than the others. I...

متن کامل

Transcript mapping for handwritten Chinese documents by integrating character recognition model and geometric context

Creating document image datasets with ground-truths of regions, text lines and characters is a prerequisite for document analysis research. However, ground-truthing large datasets is not only laborious and time consuming but also prone to errors due to the difficulty of character segmentation and the large variability of character shape, size and position. This paper describes an effective reco...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005