Fuzzy Joins in MapReduce: An Experimental Study

نویسندگان

Ben Kimmett

S. Venkatesh

Alex Thomo

چکیده

We report experimental results for the MapReduce algorithms proposed by Afrati, Das Sarma, Menestrina, Parameswaran and Ullman in ICDE’12 to compute fuzzy joins of binary strings using Hamming Distance. Their algorithms come with complete theoretical analysis, however, no experimental evaluation is provided. They argue that there is a tradeoff between communication cost and processing cost, and that there is a skyline of the proposed algorithms; i.e. none dominates another. We observe via experiments that, from a practical point of view, some algorithms are almost always preferable to others. We provide detailed experimental results and insights that show the different facets of each algorithm. 1. OBJECTIVES In [1] there are several algorithms proposed for performing “fuzzy join” (an operation that finds pairs of similar items) in MapReduce. The main part of [1] concentrates on binary strings and Hamming distance; this offers the clearest view of the various algorithmic approaches. The algorithms proposed are: Naive, which compares every string in the set with every other; Ball-Hashing, a family of two algorithms that send strings to a ‘ball’ of all ‘nearby strings’ within a certain similarity; Anchor Points, a randomized algorithm that selects a set of strings and compares any pair of strings that have a close enough distance to a member of the set; and Splitting, an algorithm that splits the strings into pieces and compares only strings with matching pieces. It is argued in [1] that there is a tradeoff between communication cost and processing cost, and that there is a skyline of the proposed algorithms; i.e. none dominates another. One of our objectives is to see whether we can observe this skyline in practical terms. We observe via experiments that, from a practical point of view, some algorithms are almost always preferable to others. For example in our experiments, Hamming Code is another algorithm that [1] considers, however, it is a special case of Anchor Points. This work is licensed under the Creative Commons AttributionNonCommercial-NoDerivs 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain permission prior to any use beyond those covered by the license. Contact copyright holder by emailing [email protected]. Articles from this volume were invited to present their results at the 41st International Conference on Very Large Data Bases, August 31st September 4th 2015, Kohala Coast, Hawaii. Proceedings of the VLDB Endowment, Vol. 8, No. 12 Copyright 2015 VLDB Endowment 2150-8097/15/08. Splitting is a clear winner, whereas Ball-Hashing suffers for all distance thresholds except the very small ones. We provide detailed experiments and insights that show different facets of each algorithm. Another objective we set is to provide implementation optimizations whenever possible. Specifically, we provide optimizations for Naive and Ball Hashing, and clarify details for the others. 2. ALGORITHMS AND IMPLEMENTATION Naive algorithm. This algorithm sends a section of the input to every physical reducer. Each reducer checks every possible pair of strings (out of the set it received) to see if they are within distance d of one another. More specifically, let K = J(J + 1)/2 be the number of reducers. They are keyed by (i, j), where 0 ≤ i ≤ j < J , thus forming a triangular matrix, where only reducer (i, j) or (j, i) exists. Strings s ∈ S are hashed to values in [0, J). If a string s hashes to i, it is sent to reducer (i, j) or (j, i), whichever exists, for each j ∈ [0, J). So, each string is sent to exactly J reducers. Then, [1] suggests that each reducer should exhaustively compare each possible pair of strings from the portion of S it received. This, however, does not need to be so. Our optimization. Consider strings s and t that both hash to i. They will be sent to reducers (i, i), (i, i+1), . . . , (i, J−1), and each of these reducers will compare them for similarity. It is clear that only one of the reducers needs to compare s with t, say reducer (i, i); the other reducers should not compare s with t as this would be redundant. More formally, with our optimization, a reducer (i, j) only compares strings that hash to i with those that hash to j. This optimization reduces the amount of work in the reducers by about 2/3. Furthermore this eliminates duplicate output, one of the goals in [1]. The reduction of work by 2/3 is explained as follows. Let ni and nj be the numbers of strings hashing to i and j, respectively. An unoptimized reducer does (ni + n 2 j + ninj)/2 string comparisons and an optimized one does just ninj/2. Then the reduction of work is by (ni + n 2 j )/2 comparisons, which is (ni + n 2 j )/(n 2 i + n 2 j + ninj) = 2/3 of the work, if we assume ni ≈ nj in the average case. Ball-Hashing (BH). This is a family of two algorithms. For these algorithms there is one reducer for each of the possible strings in the universe, so in practice, the reducers are logical rather than physical. A reducer serving string s will receive input strings that are within a ball of a certain

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Theoretical and Experimental Comparison of Filter-Based Equijoins in MapReduce

MapReduce has become an increasingly popular framework for large-scale data processing. However, complex operations such as joins are quite expensive and require sophisticated techniques. In this paper, we review state-of-the-art strategies for joining several relations in a MapReduce environment and study their extension with filter-based approaches. The general objective of filters is to elim...

متن کامل

Efficient Large Outer Joins over MapReduce

Big Data analytics largely rely on being able to execute large joins efficiently. Though inner join approaches have been extensively evaluated in parallel and distributed systems, there is little published work providing analysis of outer joins, especially on the extremely popular MapReduce platform. In this paper, we studied several current algorithms/techniques used in large outer joins. We f...

متن کامل

Technical Report: MapReduce-based Similarity Joins

Cloud enabled systems have become a crucial component to efficiently process and analyze massive amounts of data. One of the key data processing and analysis operations is the Similarity Join, which retrieves all data pairs whose distances are smaller than a predefined threshold ε. Even though multiple algorithms and implementation techniques have been proposed for Similarity Joins, very little...

متن کامل

Efficient Processing Distributed Joins with Bloomfilter using MapReduce †

The MapReduce framework has been widely used to process and analyze largescale datasets over large clusters. As an essential problem, join operation among large clusters attracts more and more attention in recent years due to the utilization of MapReduce. Many strategies have been proposed to improve the efficiency of distributed join, among which bloomfilter is a successful one. However, the b...

متن کامل

Scale reasoning with fuzzy-EL ontologies based on MapReduce

Fuzzy extension of Description Logics (DLs) allows the formal representation and handling of fuzzy or vague knowledge. In this paper, we consider the problem of reasoning with fuzzy-EL, which is a fuzzy extension of EL+. We first identify the challenges and present revised completion classification rules for fuzzy-EL that can be handled by MapReduce programs. We then propose an algorithm for sc...

متن کامل

Efficient and Scalable Graph Similarity Joins in MapReduce

Along with the emergence of massive graph-modeled data, it is of great importance to investigate graph similarity joins due to their wide applications for multiple purposes, including data cleaning, and near duplicate detection. This paper considers graph similarity joins with edit distance constraints, which return pairs of graphs such that their edit distances are no larger than a given thres...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

PVLDB

دوره 8 شماره

صفحات -

تاریخ انتشار 2015

Fuzzy Joins in MapReduce: An Experimental Study

نویسندگان

چکیده

منابع مشابه

A Theoretical and Experimental Comparison of Filter-Based Equijoins in MapReduce

Efficient Large Outer Joins over MapReduce

Technical Report: MapReduce-based Similarity Joins

Efficient Processing Distributed Joins with Bloomfilter using MapReduce †

Scale reasoning with fuzzy-EL ontologies based on MapReduce

Efficient and Scalable Graph Similarity Joins in MapReduce

عنوان ژورنال:

اشتراک گذاری