apache spark

A comparison on scalability for batch big data processing on Apache Spark and Apache Flink

2016

Diego García-Gil Sergio Ramírez-Gallego Salvador García Francisco Herrera

*Correspondence: [email protected] 1Department of Computer Science and Artificial Intelligence, CITIC-UGR (Research Center on Information and Communications Technology), University of Granada, Calle Periodista Daniel Saucedo Aranda, 18071 Granada, Spain Full list of author information is available at the end of the article Abstract The large amounts of data have created a need for new fram...

متن کامل

Experimenting sensitivity-based anonymization framework in apache spark

Journal: :Journal of Big Data 2018

متن کامل

Thrill: High-Performance Algorithmic Distributed Batch Data Processing with C++

2016

Timo Bingmann Michael Axtmann Emanuel Jöbstl Sebastian Lamm Huyen Chau Nguyen Alexander Noe Sebastian Schlag Matthias Stumpp Tobias Sturm Peter Sanders

We present the design and a first performance evaluation of Thrill – a prototype of a general purpose big data processing framework with a convenient data-flow style programming interface. Thrill is somewhat similar to Apache Spark and Apache Flink with at least two main differences. First, Thrill is based on C++ which enables performance advantages due to direct native code compilation, a more...

متن کامل

Trash Day: Coordinating Garbage Collection in Distributed Systems

2015

Martin Maas Timothy L. Harris Krste Asanovic John Kubiatowicz

Cloud systems such as Hadoop, Spark and Zookeeper are frequently written in Java or other garbage-collected languages. However, GC-induced pauses can have a significant impact on these workloads. Specifically, GC pauses can reduce throughput for batch workloads, and cause high tail-latencies for interactive applications. In this paper, we show that distributed applications suffer from each node...

متن کامل

How Data Volume Affects Spark Based Data Analytics on a Scale-up Server

2015

Ahsan Javed Awan Mats Brorsson Vladimir Vlassov Eduard Ayguadé

Sheer increase in volume of data over the last decade has triggered research in cluster computing frameworks that enable web enterprises to extract big insights from big data. While Apache Spark is gaining popularity for exhibiting superior scale-out performance on the commodity machines, the impact of data volume on the performance of Spark based data analytics in scale-up configuration is not...

متن کامل

Titian: Data Provenance Support in Spark

2015

Matteo Interlandi Kshitij Shah Sai Deep Tetali Muhammad Ali Gulzar Seunghyun Yoo Miryung Kim Todd D. Millstein Tyson Condie

Debugging data processing logic in Data-Intensive Scalable Computing (DISC) systems is a difficult and time consuming effort. Today's DISC systems offer very little tooling for debugging programs, and as a result programmers spend countless hours collecting evidence (e.g., from log files) and performing trial and error debugging. To aid this effort, we built Titian, a library that enables data ...

متن کامل

Need and Role of Scala Implementations in Bioinformatics

2017

Abbas Rehman Ali Abbas Muhammad Atif Sarwar Javed Ferzund

Next Generation Sequencing has resulted in the generation of large number of omics data at a faster speed that was not possible before. This data is only useful if it can be stored and analyzed at the same speed. Big Data platforms and tools like Apache Hadoop and Spark has solved this problem. However, most of the algorithms used in bioinformatics for Pairwise alignment, Multiple Alignment and...

متن کامل

Characterizing the Performance of Analytics Workloads on the Cray XC40

2016

Michael F. Ringenburg Shuxia Zhang Kristyn J. Maschhoff Bill Sparks Evan Racah

This paper describes an investigation of the performance characteristics of high performance data analytics (HPDA) workloads on the Cray XC40TM, with a focus on commonly-used open source analytics frameworks like Apache Spark. We look at two types of Spark workloads: the Spark benchmarks from the Intel HiBench 4.0 suite and a CX matrix decomposition algorithm. We study performance from both the...

متن کامل

A Story of Suo Motos, Judicial Activism, and Article 184 (3)

Journal: :CoRR 2015

Zubair Nabi

The synergy between Big Data and Open Data has the potential to revolutionize information access in the developing world. Following this mantra, we present the analysis of more than a decade worth of open judgements and orders from the Supreme Court of Pakistan. Our overarching goal is to discern the presence of judicial activism in the country in the wake of the Lawyers’ Movement. Using Apache...

متن کامل

Knowledge Tier Platform for Graph Mining in (Smart) Cities

2016

Miguel Nuñez-del-Prado Edgardo Bravo Miguel Sierra Isaias Hoyos Miguel Canchay

In the present effort, we present a knowledge tier platform to collect information from cities in a form of graphs. This platform enables people to share the information of the area where they live allowing them to inform about pollution, crime levels, traffic jams, streets topology, commerces, markets, etc. The main objective is to provide information, stored in Elastic about a city to find sp...

متن کامل