Aligning Sequences by Minimum Description Length
نویسنده
چکیده
This paper presents a new information theoretic framework for aligning sequences in bioinformatics. A transmitter compresses a set of sequences by constructing a regular expression that describes the regions of similarity in the sequences. To retrieve the original set of sequences, a receiver generates all strings that match the expression. An alignment algorithm uses minimum description length to encode and explore alternative expressions; the expression with the shortest encoding provides the best overall alignment. When two substrings contain letters that are similar according to a substitution matrix, a code length function based on conditional probabilities defined by the matrix will encode the substrings with fewer bits. In one experiment, alignments produced with this new method were found to be comparable to alignments from CLUSTALW. A second experiment measured the accuracy of the new method on pairwise alignments of sequences from the BAliBASE alignment benchmark.
منابع مشابه
Bayesian Top-Down Protein Sequence Alignment with Inferred Position-Specific Gap Penalties
We describe a Bayesian Markov chain Monte Carlo (MCMC) sampler for protein multiple sequence alignment (MSA) that, as implemented in the program GISMO and applied to large numbers of diverse sequences, is more accurate than the popular MSA programs MUSCLE, MAFFT, Clustal-Ω and Kalign. Features of GISMO central to its performance are: (i) It employs a "top-down" strategy with a favorable asympto...
متن کاملLess is more: towards an optimal universal description of protein folds
MOTIVATION Identification and characterization of protein structure regularities can reveal the mechanisms governing protein structure, function and evolution. Here we focus on an intermediate level of regularity. We have developed automated methods to systematically construct a dictionary of supersecondary structures that can be used as 'protein parts' to describe fold-sized structures. RESU...
متن کاملLearning A Highly Structured Motion Model for 3D Human Tracking
This paper presents our work on learning high level structure from human motion sequences, and its applications in human figure tracking. We propose a two-step unsupervised learning approach to recover the “primitives” from 3D-motion captured sequences of complex human motion. The structure recovery is done under the MDL (minimum description length) paradigm. Then the learnt dynamic model of hu...
متن کاملLearning A Highly Structured Motion Model for 3D Human Tracking
This paper presents our work on learning high level structure from human motion sequences, and its applications in human figure tracking. We use a structured representation (“primitives” and their transitions) of complex motion and propose a two-step unsupervised learning approach to recover the natural “primitives” from unsegmented 3D-motion captured sequences of complex human motion. The stru...
متن کاملInference by Conversion
We are discussing a modeling technique based on the idea to generate data sequences with a number of suggested models. These sequences are transformed, or converted, into an observed data sequence by a suitable function, or a program. The motivation for doing so is in cases where the likelihood of observed data is hard to compute, which is circumvented with an indirect approximation by trying t...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره 2007 شماره
صفحات -
تاریخ انتشار 2007