Parsing Linear Context-Free Rewriting Systems
نویسندگان
چکیده
We describe four different parsing algorithms for Linear Context-Free Rewriting Systems (Vijay-Shanker et al., 1987). The algorithms are described as deduction systems, and possible optimizations are discussed. The only parsing algorithms presented for linear contextfree rewriting systems (LCFRS; Vijay-Shanker et al., 1987) and the equivalent formalism multiple context-free grammar (MCFG; Seki et al., 1991) are extensions of the CKY algorithm (Younger, 1967), more designed for their theoretical interest, and not for practical purposes. The reason for this could be that there are not many implementations of these grammar formalisms. However, since a very important subclass of the Grammatical Framework (Ranta, 2004) is equivalent to LCFRS/MCFG (Ljunglöf, 2004a; Ljunglöf, 2004b), there is a need for practical parsing algorithms. In this paper we describe four different parsing algorithms for Linear Context-Free Rewriting Systems. The algorithms are described as deduction systems, and possible optimizations are discussed. 1 Introductory definitions A record is a structure Γ = {r1 = a1; . . . ; rn = an}, where all ri are distinct. That this can be seen as a set of feature-value pairs. This means that we can define a simple version of record unification Γ1 t Γ2 as the union Γ1∪Γ2, provided that there is no r such that Γ1.r 6= Γ2.r. We sometimes denote a sequence X1, . . . , Xn by the more compact ~ X . To update the ith record in a list of records, we write ~Γ[i := Γ]. To substitute a variable Bk for a record Γk in any data structure Γ, we write Γ[Bk/Γk]. 1.1 Decorated Context-Free Grammars The context-free approximation described in section 4 uses a form of CFG with decorated rules of the form f : A → α, where f is the name of the rule, and α is a sequence of terminals and categories subscripted with information needed for post-processing of the context-free parse result. In all other respects a decorated CFG can be seen as a straight-forward CFG. 1.2 Linear Context-Free Rewriting Systems A linear context-free rewriting system (LCFRS; VijayShanker et al., 1987) is a linear, non-erasing multiple context-free grammar (MCFG; Seki et al., 1991). An MCFG rule is written1 A → f [B1 . . . Bδ] := { r1 = α1; . . . ; rn = αn } where A and Bi are categories, f is the name of the rule, ri are record labels and αi are sequences of terminals and argument projections of the form Bi.r. The language L(A) of a category A is a set of string records, and is defined recursively as L(A) = { Φ[B1/Γ1, . . . , Bδ/Γδ] | A → f [B1 . . . Bδ] := Φ, Γ1 ∈ L(B1), . . . , Γδ ∈ L(Bδ) } It is the possibility of discontinuous constituents that makes LCFRS/MCFG more expressive than context-free grammars. If the grammar only consists of single-label records, it generates a context-free language. Example A small example grammar is shown in figure 1, and generates the language L(S) = { s shm | s ∈ (a ∪ b)∗ } where shm is the homomorphic mapping such that each a in s is translated to c, and each b is translated to d. Examples of generated strings are ac, abcd and bbaddc. However, neither abc nor abcdabcd will be We borrow the idea of equating argument categories and variables from Nakanishi et al. (1997) , but instead of tuples we use the equivalent notion of records for the linearizations. Figure 1: An example grammar describing the language { s shm | s ∈ (a ∪ b)∗ } S → f [A] := { s = A.p A.q } A → g[A1 A2] := { p = A1.p A2.p; q = A1.q A2.q } A → ac[ ] := { p = a; q = c } A → bd[ ] := { p = b; q = d } generated. The language is not context-free since it contains a combination of multiple and crossed agreement with duplication. If there is at most one occurrence of each possible projection Ai.r in a linearization record, the MCFG rule is linear. If all rules are linear the grammar is linear. A rule is erasing if there are argument projections that have no realization in the linearization. A grammar is erasing if it contains an erasing rule. It is possible to transform an erasing grammar to non-erasing form (Seki et al., 1991). Example The example grammar is both linear and nonerasing. However, given that grammar, the rule E → e[A] := { r1 = A.p; r2 = A.p } is both non-linear (since A.p occurs more than once) and erasing (since it does not mention A.q). 1.3 Ranges Given an input string w, a range ρ is a pair of indices, (i, j) where 0 ≤ i ≤ j ≤ |w| (Boullier, 2000). The entire string w = w1 . . . wn spans the range (0, n). The word wi spans the range (i − 1, i) and the substring wi+1, . . . , wj spans the range (i, j). A range with identical indices, (i, i), is called an empty range and spans the empty string. A record containing label-range pairs, Γ = { r1 = ρ1, . . . , rn = ρn } is called a range record. Given a range ρ = (i, j), the ceiling of ρ returns an empty range for the right index, dρe = (j, j); and the floor of ρ does the same for the left index bρc = (i, i). Concatenation of two ranges is non-deterministic, (i, j) · (j′, k) = { (i, k) | j = j′ } . 1.3.1 Range restriction In order to retrieve the ranges of any substring s in a sentence w = w1 . . . wn we define range restriction of s with respect to w as 〈s〉 = { (i, j) | s = wi+1 . . . wj }, i.e. the set of all occurrences of s in w. If w is understood from the context we simply write 〈s〉. Range restriction of a linearization record Φ is written 〈Φ〉, which is a set of records, where every terminal token s is replaced by a range from 〈s〉. The range restriction of two terminals next to each other fails if range concatenation fails for the resulting ranges. Any unbound variables in Φ are unaffected by range restriction. Example Given the string w = abba, range restricting the terminal a yields 〈a〉 = { (0, 1), (3, 4) } Furthermore, 〈aA.r a bB.q〉 = { (0, 1)A.r (0, 2)B.q, (3, 4)A.r (0, 2)B.q } The other possible solutions fail since they cannot be range concatenated. 2 Parsing as deduction The idea with parsing as deduction (Shieber et al., 1995) is to deduce parse items by inference rules. A parse item is a representation of a piece of information that the parsing algorithm has acquired. An inference rule is written
منابع مشابه
Efficient Parsing of Well-Nested Linear Context-Free Rewriting Systems
The use of well-nested linear context-free rewriting systems has been empirically motivated for modeling of the syntax of languages with discontinuous constituents or relatively free word order. We present a chart-based parsing algorithm that asymptotically improves the known running time upper bound for this class of rewriting systems. Our result is obtained through a linear space construction...
متن کاملData-driven Parsing using PLCFRS Data-driven Parsing using Probabilistic Linear Context-Free Rewriting Systems
This paper presents the first efficient implementation of a weighted deductive CYK parser for Probabilistic Linear Context-Free Rewriting Systems (PLCFRS). LCFRS, an extension of CFG, can describe discontinuities in a straightforward way and is therefore a natural candidate to be used for data-driven parsing. To speed up parsing, we use different context-summary estimates of parse items, some o...
متن کاملData-Driven Parsing using Probabilistic Linear Context-Free Rewriting Systems
This paper presents the first efficient implementation of a weighted deductive CYK parser for Probabilistic Linear Context-Free Rewriting Systems (PLCFRSs). LCFRS, an extension of CFG, can describe discontinuities in a straightforward way and is therefore a natural candidate to be used for data-driven parsing. To speed up parsing, we use different context-summary estimates of parse items, some ...
متن کاملSynchronous Context-Free Tree Grammars
We consider pairs of context-free tree grammars combined through synchronous rewriting. The resulting formalism is at least as powerful as synchronous tree adjoining grammars and linear, nondeleting macro tree transducers, while the parsing complexity remains polynomial. Its power is subsumed by context-free hypergraph grammars. The new formalism has an alternative characterization in terms of ...
متن کاملData-Driven Parsing with Probabilistic Linear Context-Free Rewriting Systems
This paper presents a first efficient implementation of a weighted deductive CYK parser for Probabilistic Linear ContextFree Rewriting Systems (PLCFRS), together with context-summary estimates for parse items used to speed up parsing. LCFRS, an extension of CFG, can describe discontinuities both in constituency and dependency structures in a straightforward way and is therefore a natural candid...
متن کاملEfficient parsing with Linear Context-Free Rewriting Systems
Previous work on treebank parsing with discontinuous constituents using Linear Context-Free Rewriting systems (LCFRS) has been limited to sentences of up to 30 words, for reasons of computational complexity. There have been some results on binarizing an LCFRS in a manner that minimizes parsing complexity, but the present work shows that parsing long sentences with such an optimally binarized gr...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2005