A Corpus for Cross-Document Co-reference

نویسندگان

  • David Day
  • Janet Hitzeman
  • Michael L. Wick
  • Keith Crouch
  • Massimo Poesio
چکیده

This paper describes a newly created text corpus of news articles that has been annotated for cross-document co-reference. Being able to robustly resolve references to entities across document boundaries will provide a useful capability for a variety of tasks, ranging from practical information retrieval applications to challenging research in information extraction and natural language understanding. This annotated corpus is intended to encourage the development of systems that can more accurately address this problem. A manual annotation tool was developed that allowed the complete corpus to be searched for likely co-referring entity mentions. This corpus of 257K words links mentions of co-referent people, locations and organizations (subject to some additional constraints). Each of the documents had already been annotated for within-document coreference by the LDC as part of the ACE series of evaluations. The annotation process was bootstrapped with a string-matching-based linking procedure, and we report on some of initial experimentation with the data. The cross-document linking information will be made publicly available.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Who is Who and What is What: Experiments in Cross-Document Co-Reference

This paper describes a language-independent, scalable system for both challenges of crossdocument co-reference: name variation and entity disambiguation. We provide system results from the ACE 2008 evaluation in both English and Arabic. Our English system’s accuracy is 8.4% relative better than an exact match baseline (and 14.2% relative better over entities mentioned in more than one document)...

متن کامل

C3EL: A Joint Model for Cross-Document Co-Reference Resolution and Entity Linking

Cross-document co-reference resolution (CCR) computes equivalence classes over textual mentions denoting the same entity in a document corpus. Named-entity linking (NEL) disambiguates mentions onto entities present in a knowledge base (KB) or maps them to null if not present in the KB. Traditionally, CCR and NEL have been addressed separately. However, such approaches miss out on the mutual syn...

متن کامل

Entity Profile Extraction from Large Corpora

Information Extraction (IE) has two anchor points: (i) entity-centric information leads to an Entity Profile (EP); (ii) action-centric information leads to an Event Scenario. Based on a pipelined architecture which involves both document-level IE and corpus-level IE, a multi-level modular approach to EP extraction from large corpora is described: (i) named entity tagging; (ii) three-level patte...

متن کامل

Cross-Document Co-Reference Resolution using Sample-Based Clustering with Knowledge Enrichment

Identifying and linking named entities across information sources is the basis of knowledge acquisition and at the heart of Web search, recommendations, and analytics. An important problem in this context is cross-document coreference resolution (CCR): computing equivalence classes of textual mentions denoting the same entity, within and across documents. Prior methods employ ranking, clusterin...

متن کامل

Cross Document Co-Reference Resolution Applications For People In The Legal Domain

By combining information extraction and record linkage techniques, we have created a repository of references to attorneys, judges, and expert witnesses across a broad range of text sources. These text sources include news, caselaw, law reviews, Medline abstracts, and legal briefs among others. We briefly describe our cross document co-reference resolution algorithm and discuss applications the...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2008