A New Method for Duplicate Detection Using Hierarchical Clustering of Records

نویسندگان

چکیده مقاله:

Accuracy and validity of data are prerequisites of appropriate operations of any software system. Always there is possibility of occurring errors in data due to human and system faults. One of these errors is existence of duplicate records in data sources. Duplicate records refer to the same real world entity. There must be one of them in a data source, but for some reasons like aggregation of data sources and human faults in data entry, it is possible to appear several copies of an entity in a data source. This problem leads to error occurrence in operations or output results of a system; also, it costs a lot for related organization or business. Therefore, data cleaning process especially duplicate record detection, became one of the most important area of computer science in recent years. Many solutions presented for detecting duplicates in different situations, but they almost are all time-consuming. Also, the volume of data is growing up every day. hence, previous methods don’t have enough performance anymore. Incorrect detection of two different records as duplicates, is another problem that recent works are being faced. This becomes important because duplicates will usually be deleted and some correct data will be lost. So it seems that presenting new methods is necessary. In this paper, a method has been proposed that reduces required volume of process using hierarchical clustering with appropriate features. In this method, similarity between records has been estimated in several levels. In each level, a different feature has been used for estimating similarity between records. As a result, clusters that contain very similar records will be created in the last level. The comparisons are done on these records for detecting duplicates. Also, in this paper, a relative similarity function has been proposed for comparing between records. This function has high precision in determining the similarity. Eventually, the evaluation results show that the proposed method detects 90% of duplicate records with 97% accuracy in less time and results have improved.

برای دانلود باید عضویت طلایی داشته باشید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Duplicate Detection of Records in Queries Using Clustering

The problem of detecting and eliminating duplicated data is one of the major problems in the broad area of data cleaning and data quality in data warehouse. Many times, the same logical real world entity may have multiple representations in the data warehouse. Duplicate elimination is hard because it is caused by several types of errors like typographical errors, and different representations o...

متن کامل

Automatic Road Detection and Extraction From MultiSpectral Images Using a New Hierarchical Object-based Method

Road detection and Extraction is one of the most important issues in photogrammetry, remote sensing and machine vision. A great deal of research has been done in this area based on multispectral images, which are mostly relatively good results. In this paper, a novel automated and hierarchical object-based method for detecting and extracting of roads is proposed. This research is based on the M...

متن کامل

A new method for hierarchical clustering combination

In the field of pattern recognition, combining different classifiers into a robust classifier is a common approach for improving classification accuracy. Recently, this trend has also been used to improve clustering performance especially in non-hierarchical clustering approaches. Generally hierarchical clustering is preferred in comparison with the partitional clustering for applications when ...

متن کامل

A Hierarchical Classification Method for Breast Tumor Detection

Introduction Breast cancer is the second cause of mortality among women. Early detection of it can enhance the chance of survival. Screening systems such as mammography cannot perfectly differentiate between patients and healthy individuals. Computer-aided diagnosis can help physicians make a more accurate diagnosis. Materials and Methods Regarding the importance of separating normal and abnorm...

متن کامل

A New Hierarchical Clustering Method using Topological Map

We present a new hierarchical clustering criteria which can be applied to data set. This is done after generating an initial partition by using a Topological Self Organizing Map. This criteria contains two terms which take into account two di erent errors simultaneously: the square error of the entire clustering (as the Ward criteria) and the topological structure given by the Self Organizing M...

متن کامل

A New Sensitive Method for Detection of Viroids

Background and Aims: Viroids are smallest known plant pathogens and cause several economically significant diseases. Until recently, viroid detection relied mainly on biological tests and indexing. Today various diagnostic techniques such as nucleic acid hybridization, southern blot and reverse transcription coupled with polymerase chain reaction (RT-PCR) are being used for detection and diag...

متن کامل

منابع من

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}


عنوان ژورنال

دوره 18  شماره 4

صفحات  3- 22

تاریخ انتشار 2022-03

با دنبال کردن یک ژورنال هنگامی که شماره جدید این ژورنال منتشر می شود به شما از طریق ایمیل اطلاع داده می شود.

کلمات کلیدی

کلمات کلیدی برای این مقاله ارائه نشده است

میزبانی شده توسط پلتفرم ابری doprax.com

copyright © 2015-2023