WANalytics: Analytics for a Geo-Distributed Data-Intensive World
نویسندگان
چکیده
Large organizations today operate data centers around the globe where massive amounts of data are produced and consumed by local users. Despite their geographically diverse origin, such data must be analyzed/mined as a whole. We call the problem of supporting rich DAGs of computation across geographically distributed data Wide-Area Big-Data (WABD). To the best of our knowledge, WABD is not supported by currently deployed systems nor sufficiently studied in literature; it is addressed today by continuously copying raw data to a central location for analysis. We observe from production workloads that WABD is important for large organizations, and that centralized solutions incur substantial cross-data center network costs. We argue that these trends will only worsen as the gap between data volumes and transoceanic bandwidth widens. Further, emerging concerns over data sovereignty and privacy may trigger government regulations that can threaten the very viability of centralized solutions. To address WABD we propose WANalytics, a system that pushes computation to edge data centers, automatically optimizing workflow execution plans and replicating data when needed. Our Hadoop-based prototype delivers 257× reduction in WAN bandwidth on a production workload from Microsoft. We round out our evaluation by also demonstrating substantial gains for three standard benchmarks: TPC-CH, Berkeley Big Data, and BigBench.
منابع مشابه
CLARINET: WAN-Aware Optimization for Analytics Queries
Recent work has made the case for geo-distributed analytics, where data collected and stored at multiple datacenters and edge sites world-wide is analyzed in situ to drive operational and management decisions. A key issue in such systems is ensuring low response times for analytics queries issued against geo-distributed data. A central determinant of response time is the query execution plan (Q...
متن کاملTowards Reliable (and Efficient) Job Executions in a Practical Geo-distributed Data Analytics System
Geo-distributed data analytics are increasingly common to derive useful information in large organisations. Naive extension of existing cluster-scale data analytics systems to the scale of geo-distributed data centers faces unique challenges including WAN bandwidth limits, regulatory constraints, changeable/unreliable runtime environment, and monetary costs. Our goal in this work is to develop ...
متن کاملLow Latency Geo-distributed Data Analytics – Public Review
Large cloud service providers ingest massive amounts of data in geographically distributed sites spread across the globe. Analytics for such planetary-scale datasets is an important emerging challenge. The current practice is to copy all data to a central location, where it can be dealt with locally by standard data analytics stacks such as Hadoop and Spark. However, transferring large volumes ...
متن کاملEnergy-efficient Analytics for Geographically Distributed Big Data
Big data analytics on geographically distributed datasets (across data centers or clusters) has been attracting increasing interests in both academia and industry, posing significant complications for system and algorithm design. In this article, we systematically investigate the geo-distributed big-data analytics framework by analyzing the fine-grained paradigm and the key design principles. W...
متن کاملCreating a Portable, High-Level Graph Analytics Paradigm For Compute and Data-Intensive Applications
HPC offers tremendous potential to process large amount of data commonly referred to as ‘Big Data’. Due to the immense computational requirements of Big Data applications, the HPC and Big Data communities are converging. As a result, heterogeneous and distributed systems are becoming commonplace. In order to take advantage of the immense computing power of these systems, distributing data effic...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2015