Viewing sentence boundary detection as collocation identification

نویسندگان

  • Tibor Kiss
  • Jan Strunk
  • Milly Schär
چکیده

The detection of abbreviations is an important step in the process of sentence boundary detection. We describe a flexible, languageindependent and accurate method based on the idea that an abbreviation can be viewed as a collocation. As such, it can be identified by using methods for collocation detection such as the log likelihood ratio. Although the log likelihood ratio is known to show a good recall, its precision is poor. We employ scaling factors that lead to a strong improvement of precision. Experiments with English and German corpora show that abbreviations can be detected with high accuracy. We also show that inaccurate tokenization leads to a considerably higher error rate during tagging.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Unsupervised Multilingual Sentence Boundary Detection

In this article, we present a language-independent, unsupervised approach to sentence boundary detection. It is based on the assumption that a large number of ambiguities in the determination of sentence boundaries can be eliminated once abbreviations have been identified. Instead of relying on orthographic clues, the proposed system is able to detect abbreviations with high accuracy using thre...

متن کامل

An Algorithm Combining Statistics-based and Rules-based for Chunk Identification of Chinese Sentences

Natural language processing (NLP) is a very hot research domain. One important branch of it is sentence analysis, including Chinese sentence analysis. However, currently, no mature deep analysis theories and techniques are available. An alternative way is to perform shallow parsing on sentences which is very popular in the domain. The chunk identification is a fundamental task for shallow parsi...

متن کامل

Parsing and MWE Detection: Fips at the PARSEME Shared Task

Identifying multiword expressions (MWEs) in a sentence in order to ensure their proper processing in subsequent applications, like machine translation, and performing the syntactic analysis of the sentence are interrelated processes. In our approach, priority is given to parsing alternatives involving collocations, and hence collocational information helps the parser through the maze of alterna...

متن کامل

Sentence Analysis and Collocation Identification

Identifying collocations in a sentence, in order to ensure their proper processing in subsequent applications, and performing the syntactic analysis of the sentence are interrelated processes. Syntactic information is crucial for detecting collocations, and vice versa, collocational information is useful for parsing. This article describes an original approach in which collocations are identifi...

متن کامل

Shape Identification Technique for a 2-d Elliptic System by Boundary Integral Equation Method

This paper is concerned with the identification of the geometrical structure of the boundary shape for a two-dimensional boundary value problem. The output least square identification method is considered for estimating partially unknown boundary shapes. A numerical parameter estimation technique using the spline collocation method is proposed. lThis research was supported by the National Aeron...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2002