Constraint-Based Entity Matching

نویسندگان

  • Warren Shen
  • Xin Li
  • AnHai Doan
چکیده

Entity matching is the problem of deciding if two given mentions in the data, such as “Helen Hunt” and “H. M. Hunt”, refer to the same real-world entity. Numerous solutions have been developed, but they have not considered in depth the problem of exploiting integrity constraints that frequently exist in the domains. Examples of such constraints include “a mention with age two cannot match a mention with salary 200K” and “if two paper citations match, then their authors are likely to match in the same order”. In this paper we describe a probabilistic solution to entity matching that exploits such constraints to improve matching accuracy. At the heart of the solution is a generative model that takes into account the constraints during the generation process, and provides well-defined interpretations of the constraints. We describe a novel combination of EM and relaxation labeling algorithms that efficiently learns the model, thereby matching mentions in an unsupervised way, without the need for annotated training data. Experiments on several real-world domains show that our solution can exploit constraints to significantly improve matching accuracy, by 3-12% F-1, and that the solution scales up to large data sets.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Constraint Language Approach to Grid Resource Selection

The need to discover and select entities that match specified requirements arises in many contexts in distributed systems. Meeting this need is complicated by the fact that not only may the potential consumer specify constraints on resources, but the owner of the entity in question may specify constraints on the consumer. This observation has motivated Raman et al. to propose that discovery and...

متن کامل

A Hierarchical Image Matching Method for Stereo Satellite Imagery

Image matching is an essential and difficult task in digital photogrammetry and computer vision. This paper presents a triangulationbased hierarchical image matching algorithm for stereo satellite imagery. It uses a coarse-to-fine hierarchical strategy and combines feature points and grid points to provide a dense, precise and reliable matching result. First, some seed points are extracted at t...

متن کامل

Adaptive Approximate Record Matching

Typographical data entry errors and incomplete documents, produce imperfect records in real world databases. These errors generate distinct records which belong to the same entity. The aim of Approximate Record Matching is to find multiple records which belong to an entity. In this paper, an algorithm for Approximate Record Matching is proposed that can be adapted automatically with input error...

متن کامل

Multi-level NER for Portuguese in a CG Framework

This paper describes and evaluates a linguistically based NER system for Portuguese, based on lexico-semantical information, pattern matching and morphosyntactic, context driven Constraint Grammar rules. Preliminary Fscores for cross-domain news texts, when distinguishing six different name types, were 91.85 (raw) and 93.6 (subtyping of ready-chunked proper nouns).

متن کامل

Constraint-Based Reasoning in Geographic Databases: the Case of Symbolic Arrays

Symbolic arrays are hierarchical constraint-based representations that preserve direction relations (e.g., north, northeast) among the distinct components of complex spatial entities. They have been used in problems involving pattern matching and spatial information retrieval. In this paper we demonstrate how inference can be achieved in geographic databases of symbolic arrays using composition...

متن کامل

A Named Entity Recognizer for Danish

This paper describes how a preexisting Constraint Grammar based parser for Danish (DanGram, Bick 2002) has been adapted and semantically enhanced in order to accommodate for named entity recognition (NER), using rule based and lexical, rather than probabilistic methodology. The project is part of a multi-lingual Nordic initiative, Nomen Nescio, which targets 6 primary name types (human, organis...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2005