Training Data Selection for Record Linkage Classification
نویسندگان
چکیده
This paper presents a new two-step approach for record linkage, focusing on the creation of high-quality training data in first step. The employs unsupervised random forest model as similarity measure to produce score vector matching. Three constructions were proposed select non-match pairs data, with both balanced (symmetry) and imbalanced (asymmetry) distributions tested. top construction was found be most effective producing 100% correct labels. Random support machine classification algorithms compared, produced an F1-score comparable probabilistic linkage using expectation maximisation algorithm EpiLink. On average, forests improved by 1% recall 6.45% compared existing methods. By emphasising this has potential improve accuracy efficiency wide range applications.
منابع مشابه
Automatic Training Example Selection for Scalable Unsupervised Record Linkage
Linking records from two or more databases is becoming increasingly important in the data preparation step of many data mining projects, as linked data can enable analysts to conduct studies that are not feasible otherwise, or that would require expensive and timeconsuming collection of specific data. The aim of such linkages is to match all records that refer to the same entity. One of the mai...
متن کاملProbabilistic Linkage of Persian Record with Missing Data
Extended Abstract. When the comprehensive information about a topic is scattered among two or more data sets, using only one of those data sets would lead to information loss available in other data sets. Hence, it is necessary to integrate scattered information to a comprehensive unique data set. On the other hand, sometimes we are interested in recognition of duplications in a data set. The i...
متن کاملImproved record linkage for encrypted identifying data
The health data integration project at the E-Health Research Centre is researching ways of improving the integration of health and health related data while maintaining the privacy and security of the data. One such method is to improve the mechanisms of matching patients across databases when the identifying information must not be revealed, even during the linkage step. Background: With healt...
متن کاملData Fusion with Record Linkage
Assuming that there are two sources (e.g. les), which consist of records with diierent informations about some units like people. We want to fusion the information (data) that belong to the same units. Very often in practice no identiication numbers | like the Social Security Number SSN | are available at both les, that's why there is some uncertainity, which records belong together. Anyway, we...
متن کاملImproving Temporal Record Linkage Using Regression Classification
Temporal record linkage is the process of identifying groups of records that are collected over a period of time, such as in census or voter registration databases, where records in the same group represent the same real-world entity. Such databases often contain temporal information, such as the time when a record was created or when it was modified. Unlike traditional record linkage, which co...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Symmetry
سال: 2023
ISSN: ['0865-4824', '2226-1877']
DOI: https://doi.org/10.3390/sym15051060