Interacting with Large Distributed Datasets Using Sketch

نویسندگان

  • Mihai Budiu
  • Rebecca Isaacs
  • Derek Murray
  • Gordon Plotkin
  • Paul Barham
  • Samer Al-Kiswany
  • Yazan Boshmaf
  • Qingzhou Luo
  • Alexandr Andoni
چکیده

We present Sketch, a library and a distributed runtime for building interactive tools for exploring large datasets, distributed across multiple machines. We have built several sophisticated applications using this framework; in this paper we describe a billion-row spreadsheet, and a distributed-systems performance analyzer. Sketch applications allow interactive and responsive exploration of complex distributed datasets, scaling effectively to take advantage of large computational resources.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Bias-Aware Sketches

Count-Sketch [6] and Count-Median [11] are two widely used sketching algorithms for processing large-scale distributed and streaming datasets, such as finding frequent elements, computing frequency moments, performing point queries, etc. The errors of Count-Sketch and Count-Median are expressed in terms of the sum of coordinates of the input vector excluding those largest ones, or, the mass on ...

متن کامل

Scaling Graph-based Semi Supervised Learning to Large Number of Labels Using Count-Min Sketch

Graph-based Semi-supervised learning (SSL) algorithms have been successfully used in a large number of applications. These methods classify initially unlabeled nodes by propagating label information over the structure of graph starting from seed nodes. Graph-based SSL algorithms usually scale linearly with the number of distinct labels (m), and require O(m) space on each node. Unfortunately, th...

متن کامل

Classification of Photo and Sketch Images Using Convolutional Neural Networks

In this study we propose a Convolutional Neural Network(CNN) which can classify hand drawn sketch images. Though CNN is known to be very effective on classification of realistic images, there are few studies on CNN dealing with nonphotorealistic images and even more images those types are mixing. Classifying non-photorealistic images is difficult mainly because there are no large datasets. In t...

متن کامل

Large Scale Distributed Semi-Supervised Learning Using Streaming Approximation

Traditional graph-based semi-supervised learning (SSL) approaches, even though widely applied, are not suited for massive data and large label scenarios since they scale linearly with the number of edges |E| and distinct labels m. To deal with the large label size problem, recent works propose sketch-based methods to approximate the distribution on labels per node thereby achieving a space redu...

متن کامل

An Overview of Data Privacy in Multi-Agent Learning Systems

Public and private sector entities continuously produce, store, and transact in large amounts of data. However, combined with the growth of the internet, such datasets get stored and accessed on multiple devices, locations, and across the globe. Therefore, the necessity for autonomous agents that can learn across distributed systems to extract knowledge from large datasets while at the same tim...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016