Sorted Nearest Neighborhood Clustering for Efficient Private Blocking

نویسندگان

  • Dinusha Vatsalan
  • Peter Christen
چکیده

Record linkage is an emerging research area which is required by various real-world applications to identify which records in different data sources refer to the same real-world entities. Often privacy concerns and restrictions prevent the use of traditional record linkage applications across different organizations. Linking records in situations where no private or confidential information can be revealed is known as privacypreserving record linkage (PPRL). As with traditional record linkage applications, scalability is a main challenge in PPRL. This challenge is generally addressed by employing a blocking technique that aims to reduce the number of candidate record pairs by removing record pairs that likely refer to non-matches without comparing them in detail. This paper presents an efficient private blocking technique based on a sorted neighborhood approach that combines k-anonymous clustering and the use of public reference values. An empirical study conducted on real-world databases shows that this approach is scalable to large databases, and that it can provide effective blocking while preserving k-anonymous characteristics. The proposed approach can be up-to two orders of magnitude faster than two state-of-the-art private blocking techniques, k-nearest neighbor clustering and Hamming based locality sensitive hashing.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Parallel Sorted Neighborhood Blocking with MapReduce

Cloud infrastructures enable the efficient parallel execution of data-intensive tasks such as entity resolution on large datasets. We investigate challenges and possible solutions of using the MapReduce programming model for parallel entity resolution. In particular, we propose and evaluate two MapReduce-based implementations for Sorted Neighborhood blocking that either use multiple MapReduce j...

متن کامل

A Nearest Neighbor Method for Efficient ICP

A novel solution is presented to the Nearest Neighbor Problem that is specifically tailored for determining correspondences within the Iterative Closest Point Algorithm. The reference point set P is preprocessed by calculating for each point ~pi 2 P that neighborhood of points which lie within a certain distance of ~pi. The points within each -neighborhood are sorted by increasing distance to t...

متن کامل

On the Complexity of Sorted Neighborhood

Record linkage concerns identifying semantically equivalent records in databases. Blocking methods are employed to avoid the cost of full pairwise similarity comparisons on n records. In a seminal work, Hernàndez and Stolfo proposed the Sorted Neighborhood blocking method. Several empirical variants have been proposed in recent years. In this paper, we investigate the complexity of the Sorted N...

متن کامل

Sorted Neighborhood for Schema-free RDF Data

Entity Resolution (ER) concerns identifying pairs of entities that refer to the same underlying entity. To avoid O(n) pairwise comparison of n entities, blocking methods are used. Sorted Neighborhood is an established blocking method for Relational Databases. It has not been applied to schema-free Resource Description Framework (RDF) data sources widely prevalent in the Linked Data ecosystem. T...

متن کامل

N-Way Heterogeneous Blocking

Record linkage concerns the linkage of records between two tabular datasets. To avoid naive quadratic computation, typical solutions employ a technique called blocking. A blocking scheme partitions records into blocks, and generates a candidate set by pairing records within a block. Current models of blocking have been restricted to two homogeneous datasets. The variety aspect of Big Data motiv...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2013