Estimating Comma Placement in Natural Language

نویسندگان

  • Jason Ansel
  • Evan P. C. Jones
  • Marek Olszewski
چکیده

We study the feasibility of identifying comma locations using both n-gram models and stochastic contextfree grammars (SCFGs). Specifically, our algorithms take an input sentence without commas and returns the positions where commas should be inserted, along with probability or confidence estimates. This can be generalized to correcting comma placement with minor modifications. However, we focus on this simpler comma insertion problem. Two widely used tools for processing natural language are n-gram models and SCFG parsers. N-grams provide a linear Markov model, while SCFG parsers build a rich hierarchical structure. In English, commas are typically used to separate phrases in a sentence. Hence, it would seem logical that SCFGs should have better performance than the n-gram model, which examines sentences in n-gram length chunks. However, n-grams can easily be trained on very large data sets, and thus can provide a rich source of statistical information. Since these language models are very di↵erent, this paper evaluates both.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Automatic Comma Insertion for Japanese Text Generation

This paper proposes a method for automatically inserting commas into Japanese texts. In Japanese sentences, commas play an important role in explicitly separating the constituents, such as words and phrases, of a sentence. The method can be used as an elemental technology for natural language generation such as speech recognition and machine translation, or in writing-support tools for non-nati...

متن کامل

NoDE: A Benchmark of Natural Language Arguments

In the latest years, natural models of argumentation and argument mining are becoming more and more important topics in the argumentation community. Given this tendency, there is the need to produce standard datasets on which natural language approaches to argumentation can be evaluated. In this paper, we present NoDE, a benchmark of natural language arguments composed of three datasets, built ...

متن کامل

The white ‘comma’ as a distractive mark on the wings of comma butterflies

0003-3472 2013 The Authors. Published on behalf http://dx.doi.org/10.1016/j.anbehav.2013.10.003 Distractive marks have been suggested to prevent predator detection or recognition of a prey, by drawing the attention away from recognizable traits of the bearer. The white ‘comma’ on the wings of comma butterflies, Polygonia c-album, has been suggested to represent such a distractive mark. In a lab...

متن کامل

Argument Mining Using Argumentation Scheme Structures

Argumentation schemes are patterns of human reasoning which have been detailed extensively in philosophy and psychology. In this paper we demonstrate that the structure of such schemes can provide rich information to the task of automatically identify complex argumentative structures in natural language text. By training a range of classifiers to identify the individual proposition types which ...

متن کامل

Correcting Comma Errors in Learner Essays, and Restoring Commas in Newswire Text

While the field of grammatical error detection has progressed over the past few years, one area of particular difficulty for both native and non-native learners of English, comma placement, has been largely ignored. We present a system for comma error correction in English that achieves an average of 89% precision and 25% recall on two corpora of unedited student essays. This system also achiev...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2012