High Similarity Sequence Comparison in Clustering Large Sequence Databases
نویسندگان
چکیده
We present a fast algorithm for sequence clustering and searching which works with large sequence databases. It uses a strictly defined similarity measure. The algorithm is faster than conventional EST clustering approaches because its complexity is directly related to the number of subwords shared by the sequences. Furthermore, the algorithm also works with proteic sequences and large sequences like entire chromosomes. We present a theoretical study of our approach and provide experimental results.
منابع مشابه
Clustering of highly homologous sequences to reduce the size of large protein databases
We present a fast and flexible program for clustering large protein databases at different sequence identity levels. It takes less than 2 h for the all-against-all sequence comparison and clustering of the non-redundant protein database of over 560,000 sequences on a high-end PC. The output database, including only the representative sequences, can be used for more efficient and sensitive datab...
متن کاملFingerprinting and genetic diversity evaluation of rice cultivars using Inter Simple Sequence Repeat marker
Rice as one of the most important agricultural crops has a putative potential for ensuring food security and addressing poverty in the world. In the present study, in order to provide basic information to improve rice through breeding programs, Inter Simple Sequence Repeat marker (ISSR) was used For DNA fingerprinting and finding genetic relationships among 32 different cultivars. In this study...
متن کاملTIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets
TGICL is a pipeline for analysis of large Expressed Sequence Tags (EST) and mRNA databases in which the sequences are first clustered based on pairwise sequence similarity, and then assembled by individual clusters (optionally with quality values) to produce longer, more complete consensus sequences. The system can run on multi-CPU architectures including SMP and PVM.
متن کاملGenetic Variation Among Salvia Species Based on Sequence-Related Amplified Polymorphism (SRAP) Marker
In this study, SRAP molecular maker approach was performed to investigate genetic diversity in the Salvia genus. A total of 205 DNA bands were produced from PCR amplification of 11 Salvia species and populations using 25 selective primer combinations, of which 204 polymorphic genetic loci accounted. The total number of amplified fragments ranged from 3 to 15. The genetic similarities of 11 coll...
متن کاملQ-gram Based Database Searching Using a Suux Array (quasar)
With the increasing amount of DNA sequence information deposited in public databases, searching for similarity to a query sequence has become a basic operation in molecular biology. But even today's fast algorithms reach their limits when applied to all-versus-all comparisons of large databases. Here we present a new database searching algorithm called QUASAR (Q-gram Alignment based on Suf-x AR...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Proceedings. IEEE Computer Society Bioinformatics Conference
دوره 1 شماره
صفحات -
تاریخ انتشار 2002