Turtle: Identifying frequent k-mers with cache-efficient algorithms

نویسندگان

Rajat Shuvro Roy

Debashish Bhattacharya

Alexander Schliep

چکیده

MOTIVATION Counting the frequencies of k-mers in read libraries is often a first step in the analysis of high-throughput sequencing data. Infrequent k-mers are assumed to be a result of sequencing errors. The frequent k-mers constitute a reduced but error-free representation of the experiment, which can inform read error correction or serve as the input to de novo assembly methods. Ideally, the memory requirement for counting should be linear in the number of frequent k-mers and not in the, typically much larger, total number of k-mers in the read library. RESULTS We present a novel method that balances time, space and accuracy requirements to efficiently extract frequent k-mers even for high-coverage libraries and large genomes such as human. Our method is designed to minimize cache misses in a cache-efficient manner by using a pattern-blocked Bloom filter to remove infrequent k-mers from consideration in combination with a novel sort-and-compact scheme, instead of a hash, for the actual counting. Although this increases theoretical complexity, the savings in cache misses reduce the empirical running times. A variant of method can resort to a counting Bloom filter for even larger savings in memory at the expense of false-negative rates in addition to the false-positive rates common to all Bloom filter-based approaches. A comparison with the state-of-the-art shows reduced memory requirements and running times. AVAILABILITY AND IMPLEMENTATION The tools are freely available for download at http://bioinformatics.rutgers.edu/Software/Turtle and http://figshare.com/articles/Turtle/791582.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

These Are Not the K-mers You Are Looking For: Efficient Online K-mer Counting Using a Probabilistic Data Structure

K-mer abundance analysis is widely used for many purposes in nucleotide sequence analysis, including data preprocessing for de novo assembly, repeat detection, and sequencing coverage estimation. We present the khmer software package for fast and memory efficient online counting of k-mers in sequencing data sets. Unlike previous methods based on data structures such as hash tables, suffix array...

متن کامل

SparseAssembler2: Sparse k-mer Graph for Memory Efficient Genome Assembly

Motivation: To tackle the problem of huge memory usage associated with de Bruijn graph-based algorithms, upon which some of the most widely used de novo genome assemblers have been built, we released SparseAssembler1. SparseAssembler1 can save as much as 90% memory consumption in comparison with the state-of-art assemblers, but it requires rounds of denoising to accurately assemble genomes. Alg...

متن کامل

Mining K-mers of Various Lengths in Biological Sequences

Counting the occurrence frequency of each k-mer in a biological sequence is an important step in many bioinformatics applications. However, most k-mer counting algorithms rely on a given k to produce single-length k-mers, which is inefficient for sequence analysis for different k. Moreover, existing k-mer counters focus more on DNA sequences and less on protein ones. In practice, the analysis o...

متن کامل

Design, Implementation and Performance Analysis of Highly Efficient Algorithms for AES Key Retrieval in Access-driven Cache-based Side Channel Attacks

Leakage of information between two processes sharing the same processor cache has been exploited in many novel approaches targeting various cryptographic algorithms. The software implementation of AES is an especially attractive target since it makes extensive use of cache-resident table lookups. We consider two attack scenarios where either the plaintext or ciphertext is known. We employ a mul...

متن کامل

Efficient Approximation Algorithms for String Kernel Based Sequence Classification

Sequence classification algorithms, such as SVM, require a definition of distance (similarity) measure between two sequences. A commonly used notion of similarity is the number of matches between k-mers (k-length subsequences) in the two sequences. Extending this definition, by considering two k-mers to match if their distance is at most m, yields better classification performance. This, howeve...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

Bioinformatics

دوره 30 14 شماره

صفحات -

تاریخ انتشار 2014

Turtle: Identifying frequent k-mers with cache-efficient algorithms

نویسندگان

چکیده

منابع مشابه

These Are Not the K-mers You Are Looking For: Efficient Online K-mer Counting Using a Probabilistic Data Structure

SparseAssembler2: Sparse k-mer Graph for Memory Efficient Genome Assembly

Mining K-mers of Various Lengths in Biological Sequences

Design, Implementation and Performance Analysis of Highly Efficient Algorithms for AES Key Retrieval in Access-driven Cache-based Side Channel Attacks

Efficient Approximation Algorithms for String Kernel Based Sequence Classification

عنوان ژورنال:

اشتراک گذاری