Morphological Analysis by Multiple Sequence Alignment
نویسندگان
چکیده
In biological sequence processing, Multiple Sequence Alignment (MSA) techniques capture information about long-distance dependencies and the three-dimensional structure of protein and nucleotide sequences without resorting to polynomial complexity context-free models. But MSA techniques have rarely been used in natural language (NL) processing, and never for NL morphology induction. Our MetaMorph algorithm is a first attempt at leveraging MSA techniques to induce NL morphology in an unsupervised fashion. Given a text corpus in any language, MetaMorph sequentially aligns words of the corpus to form an MSA and then segments the MSA to produce morphological analyses. Over corpora that contain millions of unique word types, MetaMorph identifies morphemes at an F1 below state-of-the-art performance. But when restricted to smaller sets of orthographically related words, MetaMorph outperforms the state-of-the-art ParaMor-Morfessor Union morphology induction system. Tested on 5,000 orthographically similar Hungarian word types, MetaMorph reaches 54.1% and ParaMor-Morfessor just 41.9%. Hence, we conclude that MSA is a promising algorithm for unsupervised morphology induction. Future research directions are discussed.
منابع مشابه
An Application of the ABS LX Algorithm to Multiple Sequence Alignment
We present an application of ABS algorithms for multiple sequence alignment (MSA). The Markov decision process (MDP) based model leads to a linear programming problem (LPP), whose solution is linked to a suggested alignment. The important features of our work include the facility of alignment of multiple sequences simultaneously and no limit for the length of the sequences. Our goal here is to ...
متن کاملA generalization of Profile Hidden Markov Model (PHMM) using one-by-one dependency between sequences
The Profile Hidden Markov Model (PHMM) can be poor at capturing dependency between observations because of the statistical assumptions it makes. To overcome this limitation, the dependency between residues in a multiple sequence alignment (MSA) which is the representative of a PHMM can be combined with the PHMM. Based on the fact that sequences appearing in the final MSA are written based on th...
متن کاملMultiple Sequence Alignment for Morphology Induction
MetaMorph is a novel application of multiple sequence alignment (MSA) to natural language morphology induction. Given a text corpus in any language, we sequentially align a subset of the words of the corpus to form an MSA using a probabilistic scoring scheme. We then segment the MSA to produce output analyses. We used this algorithm to compete in the 2009 Morpho Challenge. The F-measure of the ...
متن کاملMolecular and Bioinformatics Analysis of Allelic Diversity in IGFBP2 Gene Promoter in Indigenous Makuee and Lori-Bakhtiari Sheep Breeds
The aim of this study was to perform molecular and bioinformatics analysis of IGFBP2 gene promoter in association with some economic traits in indigenous Makuee (MS) and Lori-Bakhtiari (LB) breeds. DNA was extracted from blood samples of 120 MS and 200 LB and a 297 bp fragment from the upstream sequences of studied gene was amplified and genotyped by single-strand conformational polymo...
متن کاملMolecular homology and multiple-sequence alignment: an analysis of’concepts and practice
Sequence alignment is just as much a part of phylogenetics as is tree building, although it is often viewed solely as a necessary tool to construct trees. However, alignment for the purpose of phylogenetic inference is primarily about homology, as it is the procedure that expresses homology relationships among the characters, rather than the historical relationships of the taxa. Molecular homol...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009