High Similarity Sequence Comparison in Clustering Large Sequence Databases

نویسندگان

Lorie Dudoignon

Eric Glémet

Hendrik Cornelis Heus

Mathieu Raffinot

چکیده

We present a fast algorithm for sequence clustering and searching which works with large sequence databases. It uses a strictly defined similarity measure. The algorithm is faster than conventional EST clustering approaches because its complexity is directly related to the number of subwords shared by the sequences. Furthermore, the algorithm also works with proteic sequences and large sequences like entire chromosomes. We present a theoretical study of our approach and provide experimental results.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Clustering of highly homologous sequences to reduce the size of large protein databases

We present a fast and flexible program for clustering large protein databases at different sequence identity levels. It takes less than 2 h for the all-against-all sequence comparison and clustering of the non-redundant protein database of over 560,000 sequences on a high-end PC. The output database, including only the representative sequences, can be used for more efficient and sensitive datab...

متن کامل

Fingerprinting and genetic diversity evaluation of rice cultivars using Inter Simple Sequence Repeat marker

Rice as one of the most important agricultural crops has a putative potential for ensuring food security and addressing poverty in the world. In the present study, in order to provide basic information to improve rice through breeding programs, Inter Simple Sequence Repeat marker (ISSR) was used For DNA fingerprinting and finding genetic relationships among 32 different cultivars. In this study...

متن کامل

TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets

TGICL is a pipeline for analysis of large Expressed Sequence Tags (EST) and mRNA databases in which the sequences are first clustered based on pairwise sequence similarity, and then assembled by individual clusters (optionally with quality values) to produce longer, more complete consensus sequences. The system can run on multi-CPU architectures including SMP and PVM.

متن کامل

Genetic Variation Among Salvia Species Based on Sequence-Related Amplified Polymorphism (SRAP) Marker

In this study, SRAP molecular maker approach was performed to investigate genetic diversity in the Salvia genus. A total of 205 DNA bands were produced from PCR amplification of 11 Salvia species and populations using 25 selective primer combinations, of which 204 polymorphic genetic loci accounted. The total number of amplified fragments ranged from 3 to 15. The genetic similarities of 11 coll...

متن کامل

Q-gram Based Database Searching Using a Suux Array (quasar)

With the increasing amount of DNA sequence information deposited in public databases, searching for similarity to a query sequence has become a basic operation in molecular biology. But even today's fast algorithms reach their limits when applied to all-versus-all comparisons of large databases. Here we present a new database searching algorithm called QUASAR (Q-gram Alignment based on Suf-x AR...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Proceedings. IEEE Computer Society Bioinformatics Conference

دوره 1 شماره

صفحات -

تاریخ انتشار 2002

High Similarity Sequence Comparison in Clustering Large Sequence Databases

نویسندگان

چکیده

منابع مشابه

Clustering of highly homologous sequences to reduce the size of large protein databases

Fingerprinting and genetic diversity evaluation of rice cultivars using Inter Simple Sequence Repeat marker

TIGR Gene Indices clustering tools (TGICL): a software system for fast clustering of large EST datasets

Genetic Variation Among Salvia Species Based on Sequence-Related Amplified Polymorphism (SRAP) Marker

Q-gram Based Database Searching Using a Suux Array (quasar)

عنوان ژورنال:

اشتراک گذاری