An Investigation of Linkage in Nominal Data

نویسنده

  • Chakkrit Snae
چکیده

The Internet provides access to vast volumes of nominal data (data containing names) collected for a range of different purposes (e.g. parish registers containing baptism, marriage, and burial records). To mine these data effectively methods must exist, that are aware both of the source and semantics of the data, as well as the types of linkage (relationships between records) that can exist. Furthermore, as well as handling the implicit constraints of nominal data, such a system must also be able to handle automatically a range of temporal and spatial rules and constraints. This paper describes some initial research into this problem. It describes a prototype nominal data workbench that allows the specification and examination of several linkage types and discusses the merits of alternative name matching methods. The paper concludes that in the cases examined so far, effective nominal data linkage is essentially a query optimization process. The process is made more efficient if linkage specific indexes exist, and suggests that query re-organization based on these indexes, though a complex process, are entirely feasible. To facilitate the use of indexes and to guide the optimization process, the work suggests use of formal ontologies. Keyword: nominal data; record linkage; linkage workbench; domain ontologies; name matching; data mining 1. A Workbench Approach Access to vast volumes of nominal data [6] (data associated with names e.g. birth/death records, parish records, census data, text articles, newspapers, multimedia, etc) is now readily available through the Internet. Mining these data resources involves linkage [33] (how two names are related, e.g. similar surname, same spatio-temporal location, legal association, etc). In this paper are presented the results of the first year of a three-year Ph.D. study into automating various aspects of nominal data linkage. In particular this paper is concerned with algorithms associated with matching one of the key elements of nominal data linkage. The human approach to nominal data linkage is to acquire a corpus of knowledge, which is then used to compare entire records, and to rank matches based on experience. Current computer based techniques use a more or less complicated matching algorithm to search a database (or databases) for linkage and then to rank the linkage probabilities. All prior knowledge is captured in the program logic, and the program is usually optimized for speed or accuracy. Current systems are unable: • dynamically to modify their prior knowledge, • arbitrarily to chose which datasets to search, • to vary the matching algorithms used and dynamically to change their control parameters, • to combine the results from disparate nominal data linkage / matching run. What is required is a workbench approach which has components that may be tailored for different situations and parameterized by the user to meet changing requirements. The goal of this research is to investigate the elements of such a workbench and to evaluate prototypes using the extensive Lancashire and Cheshire Parish Register [14] archive held on the MIMAS database computer located at Manchester University. Capturing and processing prior knowledge in the workbench approach is dependent on combining

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Probabilistic Linkage of Persian Record with Missing Data

Extended Abstract. When the comprehensive information about a topic is scattered among two or more data sets, using only one of those data sets would lead to information loss available in other data sets. Hence, it is necessary to integrate scattered information to a comprehensive unique data set. On the other hand, sometimes we are interested in recognition of duplications in a data set. The i...

متن کامل

Bayesian Linkage Analysis of Categorical Traits for Arbitrary Pedigree Designs

BACKGROUND Pedigree studies of complex heritable diseases often feature nominal or ordinal phenotypic measurements and missing genetic marker or phenotype data. METHODOLOGY We have developed a Bayesian method for Linkage analysis of Ordinal and Categorical traits (LOCate) that can analyze complex genealogical structure for family groups and incorporate missing data. LOCate uses a Gibbs sampli...

متن کامل

Transition Models for Analyzing Longitudinal Data with Bivariate Mixed Ordinal and Nominal Responses

In many longitudinal studies, nominal and ordinal mixed bivariate responses are measured. In these studies, the aim is to investigate the effects of explanatory variables on these time-related responses. A regression analysis for these types of data must allow for the correlation among responses during the time. To analyze such ordinal-nominal responses, using a proposed weighting approach, an ...

متن کامل

تحلیل روابط متقابل شهر و روستا در توسعه یافتگی روستاهای دهستان حومه، ‌شهرستان هرسین

There is a strong paradigm shift in the literature recently, on how rural development and urban development affect one another. The conventional wisdom of the last three decades suggests that urban and rural developments are separate and compete with each other for resources. However, a closer looks reveals that this is far from the truth. Relationship and range between urban-rural have many va...

متن کامل

Distance-based and probabilistic record linkage for re-identification of records with categorical variables

Record linkage methods are methods for identifying the presence of the same individual in different data files (re-identification). This paper studies and compares the two main existing approaches for record linkage: probabilistic and distance-based. The performance of both approaches is compared when data are categorical. To that end, a distance over ordinal and nominal scales is defined. The ...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2001