Massively Parallel Mapping of Next Generation Sequence Reads Using GPUs
نویسندگان
چکیده
DNA sequence alignment problem can be broadly defined as the character-level comparison of DNA sequences obtained from one or more samples against a database of reference (i.e., consensus) genome sequence of the same or a similar species. High throughput sequencing (HTS) technologies were introduced in 2006 [6], and the latest iterations of HTS technologies are able to read the genome of a human individual in just three days for a cost of ∼ $1,000. However, they also present a computational problem since the analysis of the HTS data requires the comparison of >1 billion short (100 characters, or base pairs) “reads” against a very long (3 billion base pairs) reference genome. Since DNA molecules are composed of two opposing strands (i.e. two complementary strings), the number of required comparisons are doubled. Instead of local alignment of short vs long sequences, heuristics are applied to speed up the process. First, partial sequence matches, called “seeds”, are quickly found using either Burrows Wheeler Transform (BWT) [1] followed with Ferragina-Manzini Index (FM) [2], or a simple hash table [8]. Next, the candidate locations are verified using a dynamic programming alignment algorithm that calculates Levenshtein edit distance [3], which runs in quadratic time. Although these heuristics are substantially faster than local alignment, because of the repetitive nature of the human genome, they often require hundreds of verification runs per read, imposing a heavy computational burden. However, all of these billions of alignments are independent from each other, thus the read mapping problem presents itself as embarrassingly parallel. Our goal in this project is to develop and implement a GPGPUfriendly algorithm based on Levenshtein’s algorithm [3] that can compute millions of dynamic programming matrices concurrently. We implement our algorithms using the CUDA (Compute Unified Device Architecture) platform, and test them using the NVIDIA Tesla K20 GPGPU processors. In this work, we propose a massively parallel, fast, memory-aware Levenshtein edit distance algorithm model for graphics processing units, together with Ukkonen’s approximation algorithm [7] to prevent redundant calculations in matrices. Considering the memory limitations and very high number of available threads, our algorithm ensures maximum occupancy on GPGPUs. 2. Background
منابع مشابه
JVM: Java Visual Mapping tool for next generation sequencing read.
We developed a program JVM (Java Visual Mapping) for mapping next generation sequencing read to reference sequence. The program is implemented in Java and is designed to deal with millions of short read generated by sequence alignment using the Illumina sequencing technology. It employs seed index strategy and octal encoding operations for sequence alignments. JVM is useful for DNA-Seq, RNA-Seq...
متن کاملThe GNUMAP algorithm: unbiased probabilistic mapping of oligonucleotides from next-generation sequencing
MOTIVATION The advent of next-generation sequencing technologies has increased the accuracy and quantity of sequence data, opening the door to greater opportunities in genomic research. RESULTS In this article, we present GNUMAP (Genomic Next-generation Universal MAPper), a program capable of overcoming two major obstacles in the mapping of reads from next-generation sequencing runs. First, w...
متن کاملMapping Accuracy of Short Reads from Massively Parallel Sequencing and the Implications for Quantitative Expression Profiling
BACKGROUND Massively parallel sequencing offers an enormous potential for expression profiling, in particular for interspecific comparisons. Currently, different platforms for massively parallel sequencing are available, which differ in read length and sequencing costs. The 454-technology offers the highest read length. The other sequencing technologies are more cost effective, on the expense o...
متن کاملSAMStat: monitoring biases in next generation sequencing data
MOTIVATION The sequence alignment/map format (SAM) is a commonly used format to store the alignments between millions of short reads and a reference genome. Often certain positions within the reads are inherently more likely to contain errors due to the protocols used to prepare the samples. Such biases can have adverse effects on both mapping rate and accuracy. To understand the relationship b...
متن کاملComparison of Sequence Reads Obtained from Three Next-Generation Sequencing Platforms
Next-generation sequencing technologies enable the rapid cost-effective production of sequence data. To evaluate the performance of these sequencing technologies, investigation of the quality of sequence reads obtained from these methods is important. In this study, we analyzed the quality of sequence reads and SNP detection performance using three commercially available next-generation sequenc...
متن کاملExact and complete short read alignment to microbial genomes using GPU programming
Motivation: The introduction of next generation sequencing techniques and especially the high-throughput systems Solexa (Illumina Inc.) and SOLiD (ABI) made the mapping of short reads to reference sequences a standard application in modern bioinformatics. Short read alignment is needed for reference based re-sequencing of complete genomes as well as for gene expression analysis based on transcr...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015