Towards a Syntactic Account of Punctuation

نویسنده

  • Bernard E. M. Jones
چکیده

Little notice has been taken of punctuation in the field of natural language processing, chiefly due to the lack of any coherent theory on which to base implementations. Some work has been carried out concerning punctuation and parsing, but much of it seems to have been rather ad-hoc and performance-motivated. This paper describes the first step towards the construction of a theoretically-motivated account of punctuation. Parsed corpora are processed to extract punctuation patterns, which are then checked and generalised to a small set of General Punctuation Rules. Their usage is discussed, and suggestions are made for possible methods of including punctuation information in grammars. 1 Introduction Ititherto, the field of punctuation has been almost completely ignored within Natural Language Processing, with perhaps the single exception of the sentence-final full-stop (period). The reason for this non-treatment has been the lack of any coherent theory Of punctuation on which a computational treatment could be based. As a result, most contemporary systems simply strip out punctuation in input text, and do not put any marks into generated texts. Intuitively, this s~ems very wrong, since punctuation is such an integral part of many written languages. If text in the real world (a newspaper, for example) were to appear without any punctuation marks, it would appear very stilted, ambiguous or infantile. Therefore it is likely that any computational system that ignores these extra textual cues will suffer a degradation in performance , or at the very least a great restriction in the class of linguistic data it is able to process. Several studies have already shown the potential for using punctuation within NLP. Dale (1991) has shown the positive benefits of using punctuation ill the fields of discourse structure and semantics, suggesting that it can be used to indicate degrees of rhetorical balance and aggre-gation between juxtaposed elements, and also that in certain cases a punctuation mark can determine the rhetorical relations that hold between two elements. In the field of syntax Jones (1994) has shown, through a comparison of the performance of a grammar that uses punctuation and one which does not, that for the more complex sentences of real language, parsing with a punctuated grammar yields around two orders of magnitude fewer parses than parsing with an nnpunctuated grammar, and that additionally the punctuated parses better reflect the linguistic structure of the sentences. Briscoe and Carroll (1995) extend this work to show the …

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

An Information-Based Treatment of Punctuation

Punctuation marks have recently attracted attention within the linguistics community mostly from a syntactic perspective. In this paper, we aim to give a preliminary account of information-based aspects of punctuation marks, drawing our points from examples and links with related phenomena such as intonation. We give our initial treatment within the Discourse Representation Theory.

متن کامل

Toward a punctuation checker for Basque

Until some years ago, researchers in computational linguistics have ignored punctuation. Nevertheless, since the publication of Nunberg’s monograph [Numberg G., 1990], punctuation works have increased [Bayraktar M. et al., 1998] [Hardt D., 2001] [Pala K. et al., 2003], and, recently, it is used more and more for different tasks of Natural Language Processing. Our research group has been working...

متن کامل

The Syntax and Semantics of Punctuation and Its Use in Interpretation

In this paper, I argue for a declarative description of the syntax and semantics of punctuation marks (in English) couched in a feature/uniication-based phrase structure formalism, describe how Nunberg's (1990) syntactic analysis of punctuation can be combined with Dale's (1991) suggested semantic analysis within this framework, and present experimental evidence that 1) the resulting text gramm...

متن کامل

Improving parsing by incorporating 'prosodic clause boundaries into a grammar

In written language, punctuation is used to separate main and subordinate clause. In spoken language, ambiguities arise due to missing punctuation, but clause boundaries are often marked prosodically and can be used instead. We detect PCBs (Prosodically marked Clause Boundaries)by using prosodic features (duration, intonation, energy, and pause information) with a neural network, achieving a re...

متن کامل

Towards Testing the Syntax of Punctuation

Little work has been done in NLP on the subject of punctuation, owing mainly to a lack of a good theory on which computational treatments could be based. This paper described early work in progress to try to construct such a theory. Two approaches to finding the syntactic function of punctuation marks are discussed, and procedures are described by which the results from these approaches can be ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1996