Automatic Grammar Induction and Parsing Free Text: A Transformation-Based Approach

ثبت نشده

چکیده

1. I N T R O D U C T I O N There has been a great deal of interest of late in the automatic induction of natural language grammar. Given the difficulty inherent in manually building a robust parser, along with the availability of large amounts of training material, automatic grammar induction seems like a path worth pursuing. A number of systems have been built which can be trained automatically to bracket text into syntactic constituents. In [ 10] mutual information statistics are extracted from a corpus of text and this information is then used to parse new text. [13] defines a function to score the quality of parse trees, and then uses simulated annealing to heuristically explore the entire space of possible parses for a given sentence. In [3], distributional analysis techniques are applied to a large corpus to learn a context-free grammar. The most promising results to date have been based on the inside-outside algorithm (i-o algorithm), which can be used to train stochastic context-free grammars. The i-o algorithm is an extension of the finite-state based Hidden Markov Model (by [1]), which has been applied successfully in many areas, including speech recognition and part of speech tagging. A number of recent papers have explored the potential of using the i-o algorithm to automatically learn a grammar [9, 15, 12, 6, 7, 14]. Below, we describe a new technique for grammar induction. 2 *The author would like to thank Mark Liberman, Meiting Lu, David Magerman, Mitch Marcus, Rich Pito, Giorgio Satta, Yves Schabes and Tom Veatch. This work was supported by DARPA and AFOSR jointly under grant No. AFOSR-90-0066, and by A'RO grant No. DAAL 03-89-C0031 PRI. INot in the traditional sense of the term. 2A similar method has been applied effectively in part of speech tagging; The algorithm works by beginning in a very naive state of knowledge about phrase structure. By repeatedly comparing the results of parsing in the current state to the proper phrase structure for each sentence in the training corpus, the system learns a set of ordered transformations which can be applied to reduce parsing error. We believe this technique has advantages over other methods of phrase structure induction. Some of the advantages include: the system is very simple, it requires only a very small set of transformations, learning proceeds quickly and achieves a high degree of accuracy, and only a very small training corpUs is necessary. In addition, since some tokens in a sentence are not even considered in parsing, the method could prove to be considerably more resistant to noise than a CFG-based approach. After describing the algorithm, we present results and compare these results to other recent results in automatic phrase structure induction. 2. T H E A L G O R I T H M The learning algorithm is trained on a small corpus of partially bracketed text which is also annotated with part of speech information. All of the experiments presented below were done using the Penn Treebank annotated corpus[11]. The learner begins in a naive initial state, knowing very little about the phrase structure of the target corpus. In particular, all that is initially known is that English tends to be right branching and that final punctuation is final punctuation. Transformations are then learned automatically which transform the output of the naive parser into output which better resembles the phrase structure found in the training corpus. Once a set of transformations has been learned, the system is capable of taking sentences tagged with parts of speech and returning a binary-branching structure with nonterminals unlabelled 3. 2.1. The Initial State O f The Parser Initially, the parser operates by assigning a right-linear structure to all sentences. The only exception is that final punctuation is attached high. So, the sentence "The dog and old cat ate ." would be incorrectly bracketed as: ( ( The ( dog ( and ( old ( cat.ate ) ) ) ) ) . ) see [5, 4]. 3This is the same output given:by systems described in [10, 3, 12, 14]

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Transformation Based Error Driven Parsing

In this paper we describe a new technique for parsing free text a transformational grammar is automatically learned that is ca pable of accurately parsing text into binary branching syntactic trees The algorithm works by beginning in a very naive state of knowledge about phrase structure By repeat edly comparing the results of bracketing in the current state to proper bracketing provided in the...

متن کامل

Tiny Corpus Applications with Transformation-Based Error-Driven Learning : Evaluations of Automatic Grammar Induction and Partial Parsing of SaiSiyat

This paper reports a preliminary result on automatic grammar induction based on the framework of Brill and Markus (1992) and binary-branching syntactic parsing of Esperanto and SaiSiyat (a Formosan language). Automatic grammar induction requires large corpus and is found implausible to process endangered minor languages. Syntactic parsing, on the contrary, needs merely tiny corpus and works alo...

متن کامل

Semi-automatic acquisition of domain-specific semantic structures

This paper describes a methodology for semi-automatic grammar induction from unannotated corpora belonging to a restricted domain. The grammar contains both semantic and syntactic structures, which are conducive towards language understanding. Our work aims to ameliorate the reliance of grammar development on expert handcrafting or the availability of annotated corpora. To strive for a reasonab...

متن کامل

Joint learning of ontology and semantic parser from text

Semantic parsing methods are used for capturing and representing semantic meaning of text. Meaning representation capturing all the concepts in the text may not always be available or may not be sufficiently complete. Ontologies provide a structured and reasoning-capable way to model the content of a collection of texts. In this work, we present a novel approach to joint learning of ontology an...

متن کامل