Gene mention normalization in full texts using GNAT and LINNAEUS

نویسندگان

  • Illés Solt
  • Martin Gerner
  • Philippe Thomas
  • Goran Nenadic
  • Casey M. Bergman
  • Ulf Leser
  • Jörg Hakenberg
چکیده

Gene mention normalization (GN) refers to the automated mapping of gene names to a unique identifier, such as an NCBI Entrez Gene ID. Such knowledge helps in indexing and retrieval, linkage to additional information (such as sequences), database curation, and data integration. We present here an ensemble system encompassing LINNAEUS for recognizing organism names and GNAT for recognition and normalization of gene mentions, taking into account the species information provided by LINNAEUS. Candidate identifiers are filtered through a series of steps that take the local context of a given mention into account. On the BioCreative III high-quality training data, our system achieves TAP-5 and TAP-20 scores of 0.36 and 0.41, respectively. On the evaluation set of 50 documents that were provided to participants, we achieve scores of 0.16 and 0.20 for TAP-5 and TAP-20, respectively. Our analysis of the evaluation results suggests that the lower scores primarily are due to significant differences in species composition, and partly due to the method for selecting the evaluation data.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Inter-species normalization of gene mentions with GNAT

MOTIVATION Text mining in the biomedical domain aims at helping researchers to access information contained in scientific publications in a faster, easier and more complete way. One step towards this aim is the recognition of named entities and their subsequent normalization to database identifiers. Normalization helps to link objects of potential interest, such as genes, to detailed informatio...

متن کامل

The GNAT library for local and remote gene mention normalization

SUMMARY Identifying mentions of named entities, such as genes or diseases, and normalizing them to database identifiers have become an important step in many text and data mining pipelines. Despite this need, very few entity normalization systems are publicly available as source code or web services for biomedical text mining. Here we present the Gnat Java library for text retrieval, named enti...

متن کامل

Collective Instance-Level Gene Normalization on the IGN Corpus

A high proportion of life science researches are gene-oriented, in which scientists aim to investigate the roles that genes play in biological processes, and their involvement in biological mechanisms. As a result, gene names and their related information turn out to be one of the main objects of interest in biomedical literatures. While the capability of recognizing gene mentions has made sign...

متن کامل

Cross-species Gene Normalization at the University of Iowa

Background: With the increasing availability of full text articles through open access publishing, the scope of biomedical text mining is no longer limited to the abstracts of research literature. Cross-species gene normalization using full-text articles is an important step towards the use of full text articles in the area of biomedical text-mining research. This was one of the goals of the Bi...

متن کامل

Towards Gene Recognition from Rare and Ambiguous Abbreviations using a Filtering Approach

Retrieving information about highly ambiguous gene/protein homonyms is a challenge, in particular where their non-protein meanings are more frequent than their protein meaning (e. g., SAH or HF). Due to their limited coverage in common benchmarking data sets, the performance of existing gene/protein recognition tools on these problematic cases is hard to assess. We uniformly sample a corpus of ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2010