Structure Alignment Using Bilingual Chunking
نویسندگان
چکیده
A new statistical method called “bilingual chunking” for structure alignment is proposed. Different with the existing approaches which align hierarchical structures like sub-trees, our method conducts alignment on chunks. The alignment is finished through a simultaneous bilingual chunking algorithm. Using the constrains of chunk correspondence between source language (SL)1 and target language (TL), our algorithm can dramatically reduce search space, support time synchronous DP algorithm, and lead to highly consistent chunking. Furthermore, by unifying the POS tagging and chunking in the search process, our algorithm alleviates effectively the influence of POS tagging deficiency to the chunking result. The experimental results with EnglishChinese structure alignment show that our model can produce 90% in precision for chunking, and 87% in precision for chunk alignment. Introduction We address here the problem of structure alignment, which accepts as input a sentence pair, This work was done while the author was visiting Microsoft Research Asia 1 In this paper, we take English-Chinese parallel text as example; it is relatively easy, however, to be extended to other language pairs. and produces as output the parsed structures of both sides with correspondences between them. The structure alignment can be used to support machine translation and cross language information retrieval by providing extended phrase translation lexicon and translation templates. The popular methods for structure alignment try to align hierarchical structures like sub-trees with parsing technology. However, the alignment accuracy cannot be guaranteed since no parser can handle all authentic sentences very well. Furthermore, the strategies which were usually used for structure alignment suffer from serious shortcomings. For instance, parse-to-parse matching which regards parsing and alignment as separate and successive procedures suffers from the inconsistency between grammars of different languages. Bilingual parsing which looks upon parsing and alignment as a simultaneous procedure needs an extra ‘bilingual grammar’. It is, however, difficult to write a complex ‘bilingual grammar’. In this paper, a new statistical method called “bilingual chunking” for structure alignment is proposed. Different with the existing approaches which align hierarchical structures like sub-trees, our method conducts alignment on chunks. The alignment is finished through a simultaneous bilingual chunking algorithm. Using the constrains of chunk correspondence between source language (SL) and target language (TL), our algorithm can dramatically reduce search space, support time synchronous DP algorithm, and lead to highly consistent chunking. Furthermore, by unifying the POS tagging and chunking in the search process, our algorithm alleviates effectively the influence of POS tagging deficiency to the chunking result. The experimental results with EnglishChinese structure alignment show that our model can produce 90% in precision for chunking, and 87% in precision for chunk alignment. 1 Related Works Most of the previous works conduct structure alignment with complex, hierarchical structures, such as phrase structures (e.g., Kaji, Kida & Morimoto, 1992), or dependency structures (e.g., Matsumoto et al. 1993; Grishman, 1994; Meyers, Yanharber & Grishman 1996; Watanabe, Kurohashi & Aramaki 2000). However, the mismatching between complex structures across languages and the poor parsing accuracy of the parser will hinder structure alignment algorithms from working out high accuracy results. A straightforward strategy for structure alignment is parse-to-parse matching, which regards the parsing and alignment as two separate and successive procedures. First, parsing is conducted on each language, respectively. Then the correspondent structures in different languages are aligned (e.g., Kaji, Kida & Morimoto 1992; Matsumoto et al. 1993; Grishman 1994; Meyers, Yanharber & Grishman 1996; Watanabe, Kurohashi & Aramaki 2000). Unfortunately, automatic parse-to-parse matching has some weaknesses as described in Wu (2000). For example, grammar inconsistency exists across languages; and it is hard to handle multiple alignment choices. To deal with the difficulties in parse-to-parse matching, Wu (1997) utilizes inversion transduction grammar (ITG) for bilingual parsing. Bilingual parsing approach looks upon the parsing and alignment as a single procedure which simultaneously encodes both the parsing and transferring information. It is, however, difficult to write a broad coverage ‘bilingual grammar’ for bilingual parsing. 2 Structure Alignment Using Bilingual Chunking 2.1 Principle The chunks, which we will use, are extracted from the Treebank. When converting a tree to the chunk sequence, the chunk types are based on the syntactic category part of the bracket label. Roughly, a chunk contains everything to the left of and including the syntactic head of the constituent of the same name. Besides the head, a chunk also contains pre-modifiers, but no post-modifiers or arguments (Erik. 2000). Using chunk as the alignment structure, we can get around the problems such as PP attachment, structure mismatching across languages. Therefore, we can get high chunking accuracy. Using bilingual chunking, we can get both high chunking accuracy and high chunk alignment accuracy by making the SL chunking process and the TL chunking process constrain and improve each other. Our ‘bilingual chunking’ model for structure alignment comprises three integrated components: chunking models of both languages, and the crossing constraint; it uses chunk as the structure. (See Fig. 1) The crossing constraint requests a chunk in one language only correspond to at most one chunk in the other language. For instance, in Fig. 2 (the dashed lines represent the word alignments; the brackets indicate the chunk boundaries), the phrase “the first man” is a monolingual chunk, it, however, should be divided into “the first” and “man” to satisfy the crossing constraint. By Source Language Chunking Model (Integrated with POS tagging) Target Language Chunking Model (Integrated with POS tagging) Crossing Constraint Fig. 1 Three components of our model [the first ][man ][who][would fly across][ the channel]
منابع مشابه
Alignment-Guided Chunking
We introduce an adaptable monolingual chunking approach–AlignmentGuided Chunking (AGC)–which makes use of knowledge of word alignments acquired from bilingual corpora. Our approach is motivated by the observation that a sentence should be chunked differently depending the foreseen end-tasks. For example, given the different requirements of translation into (say) French and German, it is inappro...
متن کاملNP Alignment in Bilingual Corpora
We created a simple gold standard for English-Hungarian NP-level alignment, Orwell’s 1984, (since this already exists in manually verified POS-tagged format in many languages thanks to the Multex and MultexEast project) by manually verifying the automaticaly generated NP chunking (we used the yamcha, mallet and hunchunk taggers) and manually aligning the maximal NPs and PPs. The maximum NP chun...
متن کامل(65) Prior Publication Data Yamamoto Et Al, " Acquisition of Phrase-level Bilingual Correspon Dence Using Dependency Structure " in Proceedings of Coling Us a Method Includes Detecting a Syntactic Chunk in a Source
. _ . . . _ Kenji Imamura “Hierarchical Phrase Alignment Harmonized With ( * ) Not1ce. Subject' to any d1scla1mer, the term of this Parsing», in Proceedings of NLPRS 2001, Tokyo}, Patent 15 extended Or adlusted under 35 Ferran Pla, Antonio Molina and Natividad Prieto “Tagging and U~S~C15403) by 939 days' Chunking with bigrams”, ACL Coling 2000, vol. 2, 18th Interna (21) APPL NO. 10/403,862 tion...
متن کاملWord Alignment for Languages with Scarce Resources Using Bilingual Corpora of Other Language Pairs
This paper proposes an approach to improve word alignment for languages with scarce resources using bilingual corpora of other language pairs. To perform word alignment between languages L1 and L2, we introduce a third language L3. Although only small amounts of bilingual data are available for the desired language pair L1-L2, large-scale bilingual corpora in L1-L3 and L2-L3 are available. Base...
متن کاملUsing Similarity Scoring To Improve the Bilingual Dictionary for Word Alignment
We describe an approach to improve the bilingual cooccurrence dictionary that is used for word alignment, and evaluate the improved dictionary using a version of the Competitive Linking algorithm. We demonstrate a problem faced by the Competitive Linking algorithm and present an approach to ameliorate it. In particular, we rebuild the bilingual dictionary by clustering similar words in a langua...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2002