Technique review A tiered approach to comparative genomics

نویسندگان

  • Travis Ptacek
  • Susan M. Sell
چکیده

Comparative genomics has emerged as a valuable tool for locating genes, transcription factor motifs and other putative control regions. There are, however, issues that keep comparative genomics from being a straightforward process. These caveats fall into three categories: database, computational and biological. In this review paper, these caveats will be discussed and illustrated using related case studies. The National Center for Biotechnology Information and University of California Santa Cruz genome databases were used, and VISTA, LAGAN and zPicture were used as comparison tools. Based on these caveats, a tiered approach to carrying out comparative genomic studies is presented. INTRODUCTION The role of comparative genomics in molecular biology With the completion and accessibility of large genome sequence databases, comparative genomics has emerged as a useful tool for developing hypotheses that can later be tested at the bench. The method is based on the premise that regions of biological importance are conserved across species and that comparison of two sequences from different species will yield regions of high identity, including putative control regions. This technique has been especially useful in identifying conserved control regions in non-coding sequences. Comparative genomic studies have been used to identify enhancers in gene-poor regions that function at near megabase distances. With access to online genomic databases, and the availability of comparative genomics tools, comparative genomics has emerged as a useful tool for cataloguing and annotating the noncoding regions of the human genome. Caveats associated with comparative genomics Even with access to whole-genomic sequence information, there are caveats. For example, the quality of the sequence data can have a negative impact on a comparative genomics study. Caveats are also associated with the programs used to run the comparisons. The use of global and local alignment strategies can have dramatic effects on the output of a study. Other factors, such as the size of the search window and the efficiency of the algorithm, can also introduce a limitation. Finally, caveats are associated with biological factors such as repeats, inversions, transpositions and pseudogenes that must be considered when interpreting the results of a comparative genomics study. The presence of pseudogenes may result in regions of high identity that are misinterpreted as a gene. Genomic rearrangements present in one species but not in another may force researchers to restructure their study. Here, three case studies are used to show how these limitations can affect a 1 7 8 & HENRY STEWART PUBLICATIONS 1473-9550. BRIEF INGS IN FUNCTIONAL GENOMICS AND PROTEOMICS . VOL 4. NO 2. 178–185. JULY 2005 comparative genomics study, and a tiered approach is proposed for dealing with these caveats. DATABASE CAVEATS The starting point of any comparative genomics study is the acquisition of data from one of a number of databases. The main concern here is the accuracy and consistency of the sequence data being used for the comparison, given that the amount of information for the full genomic sequence of an organism can be of the order of billions of bases. Furthermore, the databases are constantly being updated. Another concern involves genetic (as opposed to genomic) databases. Before whole-genome databases were assembled, sequence information was held in sub-genomic databases, which contained sequences of genes (exons, introns and cDNAs) but not a complete assembly of the genome. There are two potential problems with these sub-genomic databases. They may not contain all of the sequence needed by a researcher or they may contain different alleles or sequences, other than the wild-type. These sequence differences will affect the results of a comparative genomics study. Although the concern is less today because of the existence of genomic databases, genetic databases still exist. Sequences for organisms that do not have genome assemblies (such as non-human primates) only exist in genetic databases. One final, major concern with respect to human databases is that about 1 per cent of the human genome has yet to be sequenced. Most of these sequences are centromeric or in highly repetitive locations, thus making them hard to sequence using current methods. Some of these sequences flank genes. Researchers need to be aware of these issues when obtaining sequence information. Because information in databases may not perfectly represent any genome, researchers should check multiple databases and refer to publications to decide which databases provide the most accurate location and sequence for their locus of interest. Comparison of KRML and MafB: The importance of accurate coordinate information A case study in which the human KRML gene and the corresponding mouse MafB gene were compared, serves as an example of the consequences of using databases that are constantly being updated. Failure to take this into account may lead to the wrong regions being compared and erroneous conclusions being drawn. A comparison was made between the sequence of the mouse MafB gene and the sequence of the corresponding human gene (KRML) using zPicture. Sequence information was uploaded to zPicture using coordinate information derived from the The National Center for Biotechnology Information (NCBI) database. Sequence information was obtained by searching for MafB using NCBI’s map viewer. This comparison yielded no regions of identity (data not shown), despite the fact that MafB and KRML are known to be homologues. Another comparison was run in the same manner using the coordinates of the University of California Santa Cruz (UCSC) database. Coordinate information was obtained by searching for MafB in the UCSC genome browser. This comparison yielded high identity across the single exon gene (> 90 per cent identity). These two comparisons were performed in summer 2003. This case study shows how databases can disagree on the location of a gene. For this reason, it is important to check the databases for their revision history and supporting publications. This also demonstrates the ever-changing nature of databases. In spring 2004, the NCBI and UCSC databases agreed on the location of KRML, although they disagreed on this at the time this comparison was run (spring 2003). This is an example of why it is necessary for researchers involved in comparative genomics studies to check Database content is not static & HENRY STEWART PUBLICATIONS 1473-9550. BRIEF INGS IN FUNCTIONAL GENOMICS AND PROTEOMICS . VOL 4. NO 2. 178–185. JULY 2005 1 7 9 A tiered approach to comparative genomics

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparative genomics of human stem cell factor (SCF)

Stem cell factor (SCF) is a critical protein with key roles in the cell such as hematopoiesis, gametogenesis and melanogenesis. In the present study a comparative analysis on nucleotide sequences of SCF was performed in Humanoids using bioinformatics tools including NCBI-BLAST, MEGA6, and JBrowse. Our analysis of nucleotide sequences to find closely evolved organisms with high similarity by NCB...

متن کامل

Discovery of Single Nucleotide Polymorphisms and Mutations by Pyrosequencing

Comparative genomics, analyzing variation among individual genomes, is an area of intense investigation. DNA sequencing is usually employed to look for polymorphisms and mutations. Pyrosequencing, a real-time DNA sequencing method, is emerging as a popular platform for comparative genomics. Here we review the use of this technology for mutation scanning, polymorphism discovery and chemical hapl...

متن کامل

Gravitational Search Algorithm to Solve the K-of-N Lifetime Problem in Two-Tiered WSNs

Wireless Sensor Networks (WSNs) are networks of autonomous nodes used for monitoring an environment. In designing WSNs, one of the main issues is limited energy source for each sensor node. Hence, offering ways to optimize energy consumption in WSNs which eventually increases the network lifetime is strongly felt. Gravitational Search Algorithm (GSA) is a novel stochastic population-based meta-...

متن کامل

Choosing a Commercial Broiler Strain Based on Multicriteria Decision Analysis

With the complexity and amount of information in a wide variety of comparative performance reports in poultry production, making a decision is difficult. This problem is overcomed only when all data can be put into a common unit. For this purpose, five different decision making analysis approaches including  Maximin, Equally likely, Weighted average, Ordered weighted averages and Technique for ...

متن کامل

Comparative proteogenomics: combining mass spectrometry and comparative genomics to analyze multiple genomes.

Recent proliferation of low-cost DNA sequencing techniques will soon lead to an explosive growth in the number of sequenced genomes and will turn manual annotations into a luxury. Mass spectrometry recently emerged as a valuable technique for proteogenomic annotations that improves on the state-of-the-art in predicting genes and other features. However, previous proteogenomic approaches were li...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005