Efficient and Scalable Indexing Techniques for Biological Sequence Data

نویسندگان

  • Mihail Halachev
  • Nematollaah Shiri
  • Anand Thamildurai
چکیده

We investigate indexing techniques for sequence data, crucial in a wide variety of applications, where efficient, scalable, and versatile search algorithms are required. Recent research has focused on suffix trees (ST) and suffix arrays (SA) as desirable index representations. Existing solutions for very long sequences however provide either efficient index construction or efficient search, but not both. We propose a new ST representation, STTD64, which has reasonable construction time and storage requirement, and is efficient in search. We have implemented the construction and search algorithms for the proposed technique and conducted numerous experiments to evaluate its performance on various types of real sequence data. Our results show that while the construction time for STTD64 is comparable with current ST based techniques, it outperforms them in search. Compared to ESA, the best known SA technique, STTD64 exhibits slower construction time, but has similar space requirement and comparable search time. Unlike ESA, which is memory based, STTD64 is scalable and can handle very long sequences.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Novel Information - Aware Octree for the Visualization of Large Scale Time - varying Data

Large scale scientific simulations are increasingly generating very large data sets that present substantial challenges to current visualization systems. In this paper, we develop a new scalable and efficient scheme for the visual exploration of 4-D isosurfaces of time varying data by rendering the 3-D isosurfaces obtained through an arbitrary axis-parallel hyperplane cut. The new scheme is bas...

متن کامل

Adapting Decision Tree-Based Method to Index Large DNA-Protein Sequence Datasets

Currently, the size of biological databases has increased significantly with the growing number of users and the rate of queries where some databases are of terabyte size. Hence, there is an increasing need to access databases at the fastest possible rate. Where biologists are concerned, the need is more of a means to fast, scalable and accuracy searching in biological databases. This may seem ...

متن کامل

Adapting and Enhancing the Searching Algorithm Based on Decision Tree Indexing for Large Dna-protein Datasets

Currently, the size of biological databases has increased significantly with the growing number of users and the rate of queries where some databases are of terabyte size. Hence, there is an increasing need to access databases at the fastest possible rate. Where biologists are concerned, the need is more of a means to fast, scalable and accuracy searching in biological databases. This may seem ...

متن کامل

A Comparison of Suffix Tree based Indexing and Search Techniques for Querying Protein Structures

Biological research comes across different protein structures inside a cell which may be required to map to known proteins to quickly determine their functionality. Efficient techniques for searching a protein structure in a database containing all the known proteins are needed to classify the protein and predict its function. Comparing the structure of unknown protein individually with every p...

متن کامل

Indexing, Query and Velocity-Constrained

Moving object environments are characterized by large 10 numbers of moving objects and numerous concurrent con11 tinuous queries over these objects. Efficient evaluation of 12 these queries in response to the movement of the objects is 13 critical for supporting acceptable response times. In such 14 environments the traditional approach of building an index 15 on the objects (data) suffers from...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2007