Scalable Analytics Model Calibration with Online Aggregation
نویسندگان
چکیده
Model calibration is a major challenge faced by the plethora of statistical analytics packages that are increasingly used in Big Data applications. Identifying the optimal model parameters is a time-consuming process that has to be executed from scratch for every dataset/model combination even by experienced data scientists. We argue that the lack of support to quickly identify sub-optimal configurations is the principal cause. In this paper, we apply parallel online aggregation to identify sub-optimal configurations early in the processing by incrementally sampling the training dataset and estimating the objective function corresponding to each configuration. We design concurrent online aggregation estimators and define halting conditions to accurately and timely stop the execution. The end-result is online approximate gradient descent—a novel optimization method for scalable model calibration. We show how online approximate gradient descent can be represented as generic database aggregation and implement the resulting solution in GLADE—a state-of-the-art Big Data analytics system.
منابع مشابه
Speculative Approximations for Terascale Analytics
Model calibration is a major challenge faced by the plethora of statistical analytics packages that are increasingly used in Big Data applications. Identifying the optimal model parameters is a time-consuming process that has to be executed from scratch for every dataset/model combination even by experienced data scientists. We argue that the incapacity to evaluate multiple parameter configurat...
متن کاملGLADE-ML: A Database For Big Data Analytics
Big Data Analytics has been a hot topic in computing systems and varies systems have emerged to better support Big Data Analytics. Though databases have been the data hub for decades, they fall short of Big Data Analytics due to inherent limitations. This dissertation present GLADEML, a scalable and efficient parallel database that is specifically tailored for Big Data Analytics. Different from...
متن کاملTowards Security in Distributed Home System
Today, personal data analytics and privacy face a dichotomy: application authors and service providers require scalable analytics systems, while the users and regulators increasingly demand for applications which respect the individuals’ privacy. In this paper we propose to use VPN to solve new security challenges in a distributed home network system. On a prototype implementation, our initial ...
متن کاملSpatial Online Sampling and Aggregation
The massive adoption of smart phones and other mobile devices has generated humongous amount of spatial and spatio-temporal data. The importance of spatial analytics and aggregation is everincreasing. An important challenge is to support interactive exploration over such data. However, spatial analytics and aggregation using all data points that satisfy a query condition is expensive, especiall...
متن کاملScalable Social Analytics for Online Communities
With the constantly growing ecosphere of online communities, their managers, operators and members can hugely benefit from a rich set of tools to successfully understand, control, exploit and utilise them. This requires to extract reusable, interpretable analytics in real time from the streams of dynamically, socially produced data. In this article, we summarise our efforts in the context of th...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- IEEE Data Eng. Bull.
دوره 38 شماره
صفحات -
تاریخ انتشار 2015