hadoop

A Benchmarking Case Study of Virtualized Hadoop Performance on vSphere 5 - White Paper: VMware, Inc

2011

The performance of three Hadoop applications is reported for several virtual configurations on VMware vSphere 5 and compared to native configurations. A well-balanced seven-node AMAX ClusterMax system was used to show that the average performance difference between native and the simplest virtualized configurations is only 4%. Further, the flexibility enabled by virtualization to create multipl...

متن کامل

Distributed High-Dimensional Index Creation using Hadoop, HDFS and C++

2012

Gylfi Þór Gudmundsson Laurent Amsaleg Björn Þór Jónsson

This paper describes an initial study where the opensource Hadoop parallel and distributed run-time environment is used to speed-up the construction phase of a large high-dimensional index. This paper first discusses the typical practical problems developers may run into when porting their code to Hadoop. It then presents early experimental results showing that the performance gains are substan...

متن کامل

Cross - Layer Scheduling in Cloud Computing Systems

2014

Indranil Gupta

Today, application schedulers are decoupled from routing level schedulers, leading to sub-optimal throughput for cloud computing platforms. In this thesis, we propose a cross-layer scheduling framework that bridges the application level scheduler with the routing level scheduler (SDN). We realize our framework in a batch-processing framework (Hadoop [1]) and a streamprocessing framework (Storm ...

متن کامل

A Benchmarking Case Study of Virtualized Hadoop Performance on VMware

2011

The performance of three Hadoop applications is reported for several virtual configurations on VMware vSphere 5 and compared to native configurations. A well-balanced seven-node AMAX ClusterMax system was used to show that the average performance difference between native and the simplest virtualized configurations is only 4%. Further, the flexibility enabled by virtualization to create multipl...

متن کامل

A Task Scheduling Algorithm for Hadoop Platform

Journal: :JCP 2013

Dan Wang Jilan Chen Wenbing Zhao

MapReduce is a kind of software framework for easily writing applications which process vast amounts of data on large clusters of commodity hardware. In order to get better allocation of tasks and load balancing, the MapReduce work mode and task scheduling algorithm of Hadoop platform is analyzed in this paper. According to this situation that the number of tasks of the smaller weight job is mo...

متن کامل

GRADOOP: Scalable Graph Data Management and Analytics with Hadoop

Journal: :CoRR 2015

Martin Junghanns André Petermann Kevin Gómez Erhard Rahm

Many Big Data applications in business and science require the management and analysis of huge amounts of graph data. Previous approaches for graph analytics such as graph databases and parallel graph processing systems (e.g., Pregel) either lack sufficient scalability or flexibility and expres-siveness. We are therefore developing a new end-to-end approach for graph data management and analysi...

متن کامل

PigSPARQL: A SPARQL Query Processing Baseline for Big Data

2013

Alexander Schätzle Martin Przyjaciel-Zablocki Thomas Hornung Georg Lausen

In this paper we discuss PigSPARQL, a competitive yet easy to use SPARQL query processing system on MapReduce that allows adhoc SPARQL query processing on large RDF graphs out of the box. Instead of a direct mapping, PigSPARQL uses the query language of Pig, a data analysis platform on top of Hadoop MapReduce, as an intermediate layer between SPARQL and MapReduce. This additional level of abstr...

متن کامل

Analysis and Modeling of MapReduce’s Performance on Hadoop YARN

2015

Qiuyi Tang Thomas C. Bressoud

With the rapid growth of technology, scientists have realized the challenge of efficiently analyzing large data sets since the beginning of 21 century. Increases in data volume and data complexity shift scientists’ focus to parallel, distributed algorithms running on clusters. In 2004, Jeffrey Dean and Sanjay Ghemawat from Google introduced a new programming model to store and process large dat...

متن کامل

Pig vs Hive: Benchmarking High Level Query Languages

2014

Benjamin Jakobus Peter McBrien

This article presents benchmarking results of two benchmarking sets (run on small clusters of 6 and 9 nodes) applied to Hive and Pig running on Hadoop 0.14.1. The first set of results were obtainted by replicating the Apache Pig benchmark published by the Apache Foundation on 11/07/07 (which served as a baseline to compare major Pig Latin releases). The second results were obtained by applying ...

متن کامل

Towards Efficient Design and Implementation of a Hadoop-based Distributed Video Transcoding System in Cloud Computing Environment

2013

Myoungjin Kim Yun Cui Seungho Han Hanku Lee

In this paper, we propose a Hadoop-based Distributed Video Transcoding System in a cloud computing environment that transcodes various video codec formats into the MPEG-4 video format. This system provides various types of video content to heterogeneous devices such as smart phones, personal computers, television, and pads. We design and implement the system using the MapReduce framework, which...

متن کامل