Finding Motifs with Insufficient Number of Strong Binding Sites

نویسندگان

  • Henry C. M. Leung
  • Francis Y. L. Chin
  • Siu-Ming Yiu
  • Ronald Rosenfeld
  • Wai Wan Tsang
چکیده

A molecule called transcription factor usually binds to a set of promoter sequences of coexpressed genes. As a result, these promoter sequences contain some short substrings, or binding sites, with similar patterns. The motif discovering problem is to find these similar patterns and motifs in a set of sequences. Most existing algorithms find the motifs based on strong-signal sequences only (i.e., those containing binding sites very similar to the motif). In this paper, we use a probability matrix to represent a motif to calculate the minimum total number of binding sites required to be in the input dataset in order to confirm that the discovered motifs are not artifacts. Next, we introduce a more general and realistic energy-based model, which considers all sequences with varying degrees of binding strength to the transcription factors (as measured experimentally). By treating sequences with varying degrees of binding strength, we develop a heuristic algorithm called EBMF (Energy-Based Motif Finding Algorithm) to find the motif, which can handle sequences ranging from those that contain more than one binding site to those that contain none. EBMF can find motifs for datasets that do not even have the required minimum number of binding sites as previously derived. EBMF compares favorably with common motif-finding programs AlignACE and MEME. In particular, for some simulated and real datasets, EBMF finds the motif when both AlignACE and MEME fail to do so.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

In silico investigation of lactoferrin protein characterizations for the prediction of anti-microbial properties

Lactoferrin (Lf) is an iron-binding multi-functional glycoprotein which has numerous physiological functions such as iron transportation, anti-microbial activity and immune response. In this study, different in silico approaches were exploited to investigate Lf protein properties in a number of mammalian species. Results showed that the iron-binding site, DNA and RNA-binding sites, signal pepti...

متن کامل

Finding motifs from all sequences with and without binding sites

MOTIVATION Finding common patterns, motifs, from a set of promoter regions of coregulated genes is an important problem in molecular biology. Most existing motif-finding algorithms consider a set of sequences bound by the transcription factor as the only input. However, we can get better results by considering sequences that are not bound by the transcription factor as an additional input. RE...

متن کامل

Combining frequency and positional information to predict transcription factor binding sites

MOTIVATION Even though a number of genome projects have been finished on the sequence level, still only a small proportion of DNA regulatory elements have been identified. Growing amounts of gene expression data provide the possibility of finding coregulated genes by clustering methods. By analysis of the promoter regions of those genes, rather weak signals of transcription factor binding sites...

متن کامل

Finding Transcription Factor Binding Motifs for Coregulated Genes by Combining Sequence Overrepresentation with Cross-Species Conservation

Novel computational methods for finding transcription factor binding motifs have long been sought due to tedious work of experimentally identifying them. However, the current prevailing methods yield a large number of false positive predictions due to the short, variable nature of transcriptional factor binding sites TFBSs . We proposed here a method that combines sequence overrepresentation an...

متن کامل

Finding DNA Motifs: A Probabilistic Suffix Tree Approach

We address the problem of de novo motif identification. That is, given a set of DNA sequences we try to identify motifs in the dataset without having any prior knowledge about existence of any motifs in the dataset. We propose a method based on Probabilistic Suffix Trees (PSTs) to identify fixed-length motifs from a given set of DNA sequences. Our experiments reveal that our approach successful...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • Journal of computational biology : a journal of computational molecular cell biology

دوره 12 6  شماره 

صفحات  -

تاریخ انتشار 2005