Fuzzy Joins in MapReduce: An Experimental Study
نویسندگان
چکیده
We report experimental results for the MapReduce algorithms proposed by Afrati, Das Sarma, Menestrina, Parameswaran and Ullman in ICDE’12 to compute fuzzy joins of binary strings using Hamming Distance. Their algorithms come with complete theoretical analysis, however, no experimental evaluation is provided. They argue that there is a tradeoff between communication cost and processing cost, and that there is a skyline of the proposed algorithms; i.e. none dominates another. We observe via experiments that, from a practical point of view, some algorithms are almost always preferable to others. We provide detailed experimental results and insights that show the different facets of each algorithm. 1. OBJECTIVES In [1] there are several algorithms proposed for performing “fuzzy join” (an operation that finds pairs of similar items) in MapReduce. The main part of [1] concentrates on binary strings and Hamming distance; this offers the clearest view of the various algorithmic approaches. The algorithms proposed are: Naive, which compares every string in the set with every other; Ball-Hashing, a family of two algorithms that send strings to a ‘ball’ of all ‘nearby strings’ within a certain similarity; Anchor Points, a randomized algorithm that selects a set of strings and compares any pair of strings that have a close enough distance to a member of the set; and Splitting, an algorithm that splits the strings into pieces and compares only strings with matching pieces. It is argued in [1] that there is a tradeoff between communication cost and processing cost, and that there is a skyline of the proposed algorithms; i.e. none dominates another. One of our objectives is to see whether we can observe this skyline in practical terms. We observe via experiments that, from a practical point of view, some algorithms are almost always preferable to others. For example in our experiments, Hamming Code is another algorithm that [1] considers, however, it is a special case of Anchor Points. This work is licensed under the Creative Commons AttributionNonCommercial-NoDerivs 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc-nd/3.0/. Obtain permission prior to any use beyond those covered by the license. Contact copyright holder by emailing [email protected]. Articles from this volume were invited to present their results at the 41st International Conference on Very Large Data Bases, August 31st September 4th 2015, Kohala Coast, Hawaii. Proceedings of the VLDB Endowment, Vol. 8, No. 12 Copyright 2015 VLDB Endowment 2150-8097/15/08. Splitting is a clear winner, whereas Ball-Hashing suffers for all distance thresholds except the very small ones. We provide detailed experiments and insights that show different facets of each algorithm. Another objective we set is to provide implementation optimizations whenever possible. Specifically, we provide optimizations for Naive and Ball Hashing, and clarify details for the others. 2. ALGORITHMS AND IMPLEMENTATION Naive algorithm. This algorithm sends a section of the input to every physical reducer. Each reducer checks every possible pair of strings (out of the set it received) to see if they are within distance d of one another. More specifically, let K = J(J + 1)/2 be the number of reducers. They are keyed by (i, j), where 0 ≤ i ≤ j < J , thus forming a triangular matrix, where only reducer (i, j) or (j, i) exists. Strings s ∈ S are hashed to values in [0, J). If a string s hashes to i, it is sent to reducer (i, j) or (j, i), whichever exists, for each j ∈ [0, J). So, each string is sent to exactly J reducers. Then, [1] suggests that each reducer should exhaustively compare each possible pair of strings from the portion of S it received. This, however, does not need to be so. Our optimization. Consider strings s and t that both hash to i. They will be sent to reducers (i, i), (i, i+1), . . . , (i, J−1), and each of these reducers will compare them for similarity. It is clear that only one of the reducers needs to compare s with t, say reducer (i, i); the other reducers should not compare s with t as this would be redundant. More formally, with our optimization, a reducer (i, j) only compares strings that hash to i with those that hash to j. This optimization reduces the amount of work in the reducers by about 2/3. Furthermore this eliminates duplicate output, one of the goals in [1]. The reduction of work by 2/3 is explained as follows. Let ni and nj be the numbers of strings hashing to i and j, respectively. An unoptimized reducer does (ni + n 2 j + ninj)/2 string comparisons and an optimized one does just ninj/2. Then the reduction of work is by (ni + n 2 j )/2 comparisons, which is (ni + n 2 j )/(n 2 i + n 2 j + ninj) = 2/3 of the work, if we assume ni ≈ nj in the average case. Ball-Hashing (BH). This is a family of two algorithms. For these algorithms there is one reducer for each of the possible strings in the universe, so in practice, the reducers are logical rather than physical. A reducer serving string s will receive input strings that are within a ball of a certain
منابع مشابه
A Theoretical and Experimental Comparison of Filter-Based Equijoins in MapReduce
MapReduce has become an increasingly popular framework for large-scale data processing. However, complex operations such as joins are quite expensive and require sophisticated techniques. In this paper, we review state-of-the-art strategies for joining several relations in a MapReduce environment and study their extension with filter-based approaches. The general objective of filters is to elim...
متن کاملEfficient Large Outer Joins over MapReduce
Big Data analytics largely rely on being able to execute large joins efficiently. Though inner join approaches have been extensively evaluated in parallel and distributed systems, there is little published work providing analysis of outer joins, especially on the extremely popular MapReduce platform. In this paper, we studied several current algorithms/techniques used in large outer joins. We f...
متن کاملTechnical Report: MapReduce-based Similarity Joins
Cloud enabled systems have become a crucial component to efficiently process and analyze massive amounts of data. One of the key data processing and analysis operations is the Similarity Join, which retrieves all data pairs whose distances are smaller than a predefined threshold ε. Even though multiple algorithms and implementation techniques have been proposed for Similarity Joins, very little...
متن کاملEfficient Processing Distributed Joins with Bloomfilter using MapReduce †
The MapReduce framework has been widely used to process and analyze largescale datasets over large clusters. As an essential problem, join operation among large clusters attracts more and more attention in recent years due to the utilization of MapReduce. Many strategies have been proposed to improve the efficiency of distributed join, among which bloomfilter is a successful one. However, the b...
متن کاملScale reasoning with fuzzy-EL ontologies based on MapReduce
Fuzzy extension of Description Logics (DLs) allows the formal representation and handling of fuzzy or vague knowledge. In this paper, we consider the problem of reasoning with fuzzy-EL, which is a fuzzy extension of EL+. We first identify the challenges and present revised completion classification rules for fuzzy-EL that can be handled by MapReduce programs. We then propose an algorithm for sc...
متن کاملEfficient and Scalable Graph Similarity Joins in MapReduce
Along with the emergence of massive graph-modeled data, it is of great importance to investigate graph similarity joins due to their wide applications for multiple purposes, including data cleaning, and near duplicate detection. This paper considers graph similarity joins with edit distance constraints, which return pairs of graphs such that their edit distances are no larger than a given thres...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- PVLDB
دوره 8 شماره
صفحات -
تاریخ انتشار 2015