Pattern discovery methods for protein topology diagrams

نویسندگان

  • David R. Gilbert
  • Juris Viksna
چکیده

We are carrying out research into developing several approaches to pattern discovery in protein topology diagrams, and comparing them. The underlying motivation is to eeciently automatically generate patterns classifying sets of proteins and to apply this to characterising databases of protein structure. We are using TOPS protein topology diagrams, which we have formalised as a restricted kind of graph in 3]. In general pattern discovery in TOPS diagrams can be reduced to subgraph isomorphism/maximal common subgraph problems in oriented labelled graphs. We are using two diierent approaches (illustrated in an experiment software suite at Learning (pattern extension) is based on searching of common patterns constructed by pattern extension and on repeated pattern matching; in this sense it is similar to \pattern driven" approaches for strings 1]. In general this method is very simple: by starting with empty pattern we try to extend it in all possible ways and discard extensions that do not match a given set of examples, until we nd a largest pattern that cannot be extended further. Such an approach seems in general to be far too ineecient for graphs of size that is typical to TOPS diagrams; however a somewhat limited version of this method (that does not always guarantee the nding of best possible pattern) works surprisingly fast for most of the real examples from TOPS database. The good property of this approach is that algorithm complexity grows proportionally to the size of the set of examples in learning set. We are now working on improved versions on pattern matching and pattern learning algorithms and expect that we will be able to obtain practically eecient pattern discovery method that will guarantee the nding of maximal pattern(s). We have been experimenting with learning patterns for and domains at the H level in the CATH hierarchy. We are evaluating our results by determining how many false positives are returned by a search over the database using the discovered pattern; initial results are encouraging in many cases. Clique detection works by nding of common patterns via maximal clique detection in edge product graphs, using for example the Bron-Kerbosch algorithm 2]. We have implemented a modiied version for this approach that takes advantage of the ordering of graphs and reduces the number of edges in the edge product graph by the factor of 6. Unfortunately, this modiication also does not allow us to use Koch and Lengauer variant (algorithm …

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Pharmaceutical Advances and Proteomics Researches

Proteomics enables understanding the composition, structure, function and interactions of the entire protein complement of a cell, a tissue, or an organism under exactly defined conditions. Some factors such as stress or drug effects will change the protein pattern and cause the present or absence of a protein or gradual variation in abundances. Changes in the proteome provide a snapshot of the...

متن کامل

Pharmaceutical Advances and Proteomics Researches

Proteomics enables understanding the composition, structure, function and interactions of the entire protein complement of a cell, a tissue, or an organism under exactly defined conditions. Some factors such as stress or drug effects will change the protein pattern and cause the present or absence of a protein or gradual variation in abundances. Changes in the proteome provide a snapshot of the...

متن کامل

Pattern Matching and Pattern Discovery Algorithms for Protein Topologies

We describe algorithms for pattern matching and pattern learning in TOPS diagrams (formal descriptions of protein topologies). These problems can be reduced to checking for subgraph isomorphism and finding maximal common subgraphs in a restricted class of ordered graphs. We have developed a subgraph isomorphism algorithm for ordered graphs, which performs well on the given set of data. The maxi...

متن کامل

PTGL: a database for secondary structure-based protein topologies

With growing amount of experimental data, the number of known protein structures also increases continuously. Classification of protein structures helps to understand relationships between protein structure and function. The main classification methods based on secondary structures are SCOP, CATH and TOPS, which all classify under different aspects, and therefore can lead to different results. ...

متن کامل

Motif-based searching in TOPS protein topology databases

MOTIVATION TOPS cartoons are a schematic ion of protein three-dimensional structures in two dimensions, and are used for understanding and manual comparison of protein folds. Recently, an algorithm that produces the cartoons automatically from protein structures has been devised and cartoons have been generated to represent all the structures in the structural databank. There is now a need to b...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 1999