Dedoop: Efficient Deduplication with Hadoop

نویسندگان

Lars Kolb

Andreas Thor

Erhard Rahm

چکیده

We demonstrate a powerful and easy-to-use tool called Dedoop (Deduplication with Hadoop) for MapReduce-based entity resolution (ER) of large datasets. Dedoop supports a browser-based specification of complex ER workflows including blocking and matching steps as well as the optional use of machine learning for the automatic generation of match classifiers. Specified workflows are automatically translated into MapReduce jobs for parallel execution on different Hadoop clusters. To achieve high performance Dedoop supports several advanced load balancing strategies.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Datenbank-spektrum Schwerpunkt: Mapreduce Programming Model Compilation of Query Languages into Mapreduce Efficient or Hadoop: Why Not Both? Parallel Entity Resolution with Dedoop Inkrementelle Neuberechnungen in Mapreduce Fachbeitrag towards Integrated Data Analytics: Time Series Forecasting in Dbms

s publiziert/indexiert in Google Scholar, Academic OneFile, DBLP, io-port.net, OCLC, Summon by Serial Solutions. Hinweise für Autoren für die Zeitschrift Datenbank Spektrum finden Sie auf www.springer.com/13222. Datenbank Spektrum (2013) 13:1–3 DOI 10.1007/s13222-013-0116-z

متن کامل

Efficient Cross User Client Side Data Deduplication in Hadoop

Hadoop is widely used for applications like Aadhaar card, Healthcare, Media, Ad Platform, Fraud Detection & Crime, and Education etc. However, it does not provide efficient and optimized data storage solution. One interesting thing we found that when user uploads the same file twice with same file name it doesn’t allow saving the same file. But when user uploads the same file content with diffe...

متن کامل

A novel approach to data deduplication over the engineering-oriented cloud systems

This paper presents a duplication-less storage system over the engineering-oriented cloud computing platforms. Our deduplication storage system, which manages data and duplication over the cloud system, consists of two major components, a front-end deduplication application and a mass storage system as backend. Hadoop distributed file system (HDFS) is a common distribution file system on the cl...

متن کامل

Estimation of Secure Data Deduplication in Big Data

Bigdata is linked with the entireties of composite data sets. In bigdata environment, data is in the form of unstructured data and may contain number of duplicate copies of same data. To manage such a complex unstructured data hadoop is to be used. A hadoop is an open source platform specially designed for bigdata environment. Hadoop can handle unstructured data very efficiently as compare to t...

متن کامل

Ddup - towards a deduplication framework utilising apache spark

This paper is about a new framework called DeduPlication (DduP). DduP aims to solve large scale deduplication problems on arbitrary data tuples. DduP tries to bridge the gap between big data, high performance and duplicate detection. At the moment a first prototype exists but the overall project status is work in progress. DduP utilises the promising successor of Apache Hadoop MapReduce [Had14]...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

PVLDB

دوره 5 شماره

صفحات -

تاریخ انتشار 2012

Dedoop: Efficient Deduplication with Hadoop

نویسندگان

چکیده

منابع مشابه

Efficient Cross User Client Side Data Deduplication in Hadoop

A novel approach to data deduplication over the engineering-oriented cloud systems

Estimation of Secure Data Deduplication in Big Data

Ddup - towards a deduplication framework utilising apache spark

عنوان ژورنال:

اشتراک گذاری