MEDITE: A Unilingual Textual Aligner
نویسندگان
چکیده
This paper addresses a problem of natural language text alignment, from a humanities discipline called textual genetic criticism where different text versions must be compared. The paper shows that this task is hard because such versions can be very different and texts with a lot of internal repetitions present specific difficulties. MEDITE is a natural language text aligner that compares texts written in the same language. It detects modifications at character level, as opposed to related applications which either remain at word level or give poor results at character level. The detection of moved blocks in the text, induced by our formalism based on edit distance with moves, is introduced. The algorithm is closely related to sequence alignment in bioinformatics as similar building blocks are used and applied to this natural language processing task. A benchmark analysis has been carried out to compare MEDITE with other aligners and it shows that our approach is superior to existing ones especially in hard cases.
منابع مشابه
Alignment of noisy unstructured text data
This paper describes a textual aligner named MEDITE whose specificity is the detection of moves. It was developed to solve a problem from textual genetic criticism, a humanities discipline that compares different versions of authors’ texts in order to highlight invariants and differences between them. Our aligner handles this task and it is general enough to handle others. The algorithm, based ...
متن کاملExploitation of an Arabic Language Resource for MT Evaluation: Using Buckwalter-based Lookup Tool to Augment CMU Alignment Algorithm
Voss et al. (2006) analyzed newswire translations of three DARPA GALE Arabic-English MT systems at the segment level in terms of subjective judgment scores, automated metric scores, and correlations among these different score types. At this level of granularity, the correlations are weak. In this paper, we begin to reconcile the subjective and automated scores that underlie these correlations ...
متن کاملExploitation of an Arabic Language Resource for Machine Translation Evaluation: using Buckwalter-based Lookup Tool to Augment CMU Alignment Algorithm
Voss et al. (2006) analyzed newswire translations of three DARPA GALE Arabic-English MT systems at the segment level in terms of subjective judgment scores, automated metric scores, and correlations among these different score types. At this level of granularity, the correlations are weak. In this paper, we begin to reconcile the subjective and automated scores that underlie these correlations ...
متن کاملA Phrase-Based Alignment Model for Natural Language Inference
The alignment problem—establishing links between corresponding phrases in two related sentences—is as important in natural language inference (NLI) as it is in machine translation (MT). But the tools and techniques of MT alignment do not readily transfer to NLI, where one cannot assume semantic equivalence, and for which large volumes of bitext are lacking. We present a new NLI aligner, the MAN...
متن کاملPARMA: A Predicate Argument Aligner
We introduce PARMA, a system for crossdocument, semantic predicate and argument alignment. Our system combines a number of linguistic resources familiar to researchers in areas such as recognizing textual entailment and question answering, integrating them into a simple discriminative model. PARMA achieves state of the art results on an existing and a new dataset. We suggest that previous effor...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2006