Import of ENZYME Data into the ConceptWiki and its Representation as RDF

نویسندگان

  • Paul Boekschoten
  • Kees Burger
  • Barend Mons
  • Christine Chichester
چکیده

Solutions to the classic problems of dealing with heterogeneous data and making entire collections interoperable while ensuring that any annotation, which includes the recognition-and-reward system of scientific publishing, need to fit into a seamless beginning to end to attract large numbers of end users. The latest trend in Web applications encourages highly interactive Web sites with rich user interfaces featuring content integrated from various sources around the Web. The obvious potential of RDF, SPARQL, and OWL to provide flexible data modeling, easier data integration, and networked data access may be the answer to the classic problems. Using Semantic Web technologies we have created a Web application, the ConceptWiki, as an end-to-end solution for creating browserbased readwrite triples using RDF, which focus on data integration and ease of use for the end user. Here we will demonstrate the integration of a biological data source, the ENZYME database, into the ConceptWiki and it’s representation in RDF. The ConceptWiki (www.conceptwiki.org, Figure 1) is an open access repository of editable concepts. It is a web based wiki system that accepts essentially unlimited numbers of synonyms, in multiple languages, and then maps all the terms correctly back to one unique concept identifier, alleviating problems of vocabulary and identifier differences. Each concept in the ConceptWiki is annotated with one or more semantic types and basic information like a definition. Users can view and edit information through a uniform interface. The information in the system is stored and edited in a highly structured way, as triples (e.g. ¡concept A¿ ¡has synonym¿ ¡term B¿). The ConceptWiki backend has been designed to support the storage of concepts in a very generic form, thereby trying to avoid as much as possible the exclusion of potential valuable information sources. This compatibility with our other information storage systems enables higherlevel applications to easily query, summarize and mine the knowledge. In line with recommendations from the Concept Web Alliance, identifiers in the ConceptWiki are completely opaque; they have no inherent structure and no information can be derived from them. An opaque identifier is a robust identifier as there will never be a need to change the identifier when underlying information changes. The ConceptWiki uses Universally Unique Identifiers (UUIDs) for identifiers. The ConceptWiki contains the biomedical terminology of Unified Medical Language System thesauri, concepts from SwissProt, Medline, and in the near future the repository will be expanded to incorporate the chemical terminology from ChemSpider for biologically relevant chemical molecules. In this poster, we demonstrate the import process for another biologically relevant database, ENZYME (Figure 2). For this import, the ENZYME flat file is first converted into XML. The import script incorporates a XML parser which queries the ConceptWiki database to recover all concepts that match those in the XML. For the ENZYME data, these are concepts containing an EC number as a synonym. The stored ConceptWIki information is then compared to the ENZYME data. If the ENZYME data is unknown to the ConceptWiki, then it is inserted into the database. If the ENZYME data is found but has changes in comparison to the ConceptWiki, then the stored information is updated. These changes are shown in the interface by removing the authority checkbox if data are no longer supported by ENZYME, and if the data are new the checkbox for ENZYME authority is added. EC 1.1.1.1 has synonym Aldehyde reductase, is an example of the ENZYME data generated in the subject-predicateobject triple structure which in XML is the triple structure used by the ConceptWiki backend. Each element is stored with a UUID. The RDF data model (Figure 3) is similar to classic conceptual modeling approaches as it is based upon the idea of making statements about resources, in particular Web resources, in the form of subject-predicate-object expressions. The predicates of RDF triples are similar to hyperlinks; however, the advantage of RDF triples over HTML hyperlinks is that the links are explicitly labeled. The semantics of the relationship between the two entities is computationally accessible through URI resolution and of particular interest; the data for the ConceptWiki predicate is represented in RDF using the concept UUID. Of course, as would be expected, the subject and object are also represented in RDF using the ConceptWiki UUIDs. Using the concepts present in the ConceptWiki as a set of basic building blocks, new triples can be assembled by end users via the interface. Through the simple drop down menus, users can establish links between two records to illustrate pertinent connections. All newly created triples are attributed to the participating scientist by showing their name and listing the triple on page that represents them. For example, the triple Aldehyde reductase has function sorbitol biosynthetic process can be built using data included from the ENYME import (Figure 4). Certain agencies have indicated that they are interested in using the resulting contributions as indices of scholarly achievement.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Neuron Mathematical Model Representation of Neural Tensor Network for RDF Knowledge Base Completion

In this paper, a state-of-the-art neuron mathematical model of neural tensor network (NTN) is proposed to RDF knowledge base completion problem. One of the difficulties with the parameter of the network is that representation of its neuron mathematical model is not possible. For this reason, a new representation of this network is suggested that solves this difficulty. In the representation, th...

متن کامل

RDF Production from Municipal Wastes (Case Study: Babol City)

Today, with the advancement of technology, turning waste into fuel has been considered as an inexhaustible source of energy production, especially in industries with high energy consumption. The most important of these industries are Cement and Iron Smelting factories. Babol, as one of the largest provinces in the north of the country, has been facing waste management crisis for several years w...

متن کامل

Knowledge representation in RDF/XML, KIF, Frame-CG and Formalized-English

This article shows how RDF/XML, KIF, Frame-CG (FCG) and Formalized-English (FE) can be used in a panorama of knowledge representation cases. It highlights various inadequacies of RDF/XML, advantages provided by high-level expressive notations (FCG and FE), and the KIF translations provide a logical interpretation for the other notations. Knowledge providers may see this document as a guide for ...

متن کامل

Results of Taxonomic Evaluation of RDF(S) and DAML+OIL ontologies using RDF(S) and DAML+OIL Validation Tools and Ontology Platforms import services

Before using RDF(S) and DAML+OIL ontologies in Semantic Web applications, its content should be evaluated from a knowledge representation point of view. In recent years, some RDF(S) and DAML+OIL ‘checkers’, ‘validators’, and ‘parsers’ have been created and several ontology platforms are able to import RDF(S) and DAML+OIL ontologies. Two are the experiments presented in this paper. The first one...

متن کامل

An Improved Semantic Schema Matching Approach

Schema matching is a critical step in many applications, such as data warehouse loading, Online Analytical Process (OLAP), Data mining, semantic web [2] and schema integration. This task is defined for finding the semantic correspondences between elements of two schemas. Recently, schema matching has found considerable interest in both research and practice. In this paper, we present a new impr...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • CoRR

دوره abs/1012.1652  شماره 

صفحات  -

تاریخ انتشار 2010