hadoop

The Anatomy of MapReduce Jobs, Scheduling, and Performance Challenges

2013

Shouvik Bardhan Daniel A. Menascé

Hadoop is a leading open source tool that supports the realization of the Big Data revolution and is based on Google’s MapReduce pioneering work in the field of ultra large amount of data storage and processing. Instead of relying on expensive proprietary hardware, Hadoop clusters typically consist of hundreds or thousands of multi-core commodity machines. Instead of moving data to the processi...

متن کامل

Assessing the Performance Impact of High-Speed Interconnects on MapReduce

2012

Yandong Wang Yizheng Jiao Cong Xu Xiaobing Li Teng Wang Xinyu Que Cristian Cira Bin Wang Zhuo Liu Bliss Bailey Weikuan Yu

Hadoop is a successful open-source implementation of MapReduce programming model. It has been widely adopted by many leading industry companies for big data analytics. However, its intermediate data shuffling is a timeconsuming operation that impacts the total execution time of MapReduce programs. Recently, a growing number of organizations are interested in addressing this issue by leveraging ...

متن کامل

Assigning Tasks for Efficiency in Hadoop

2010

Michael J. Fischer Xueyuan Su Yitong Yin

In recent years Google’s MapReduce has emerged as a leading large-scale data processing architecture. Adopted by companies such as Amazon, Facebook, Google, IBM and Yahoo! in daily use, and more recently put in use by several universities, it allows parallel processing of huge volumes of data over cluster of machines. Hadoop is a free Java implementation of MapReduce. In Hadoop, files are split...

متن کامل

Beyond Batch Processing: Towards Real-Time and Streaming Big Data

Journal: :Computers 2014

Saeed Shahrivari

Today, big data is generated from many sources and there is a huge demand for storing, managing, processing, and querying on big data. The MapReduce model and its counterpart open source implementation Hadoop, has proven itself as the de facto solution to big data processing. Hadoop is inherently designed for batch and high throughput processing jobs. Although Hadoop is very suitable for batch ...

متن کامل

Optimization Framework for Map Reduce Clusters on Hadoop’s Configuration

2016

Trupti Mali Deepti Varshney

ARTICLE INFO Hadoop represents a Java-based distributed computing framework that is designed to support applications that are implemented via the MapReduce programming model. Hadoop performance however is significantly affected by the settings of the Hadoop configuration parameters. Unfortunately, manually tuning these parameters is very time-consuming. Existing system uses Random forest approa...

متن کامل

The Recovery System for Hadoop Cluster

2014

Priya Deshpande Darshan Bora

Due to brisk growth of data volume in many organizations, large-scale data processing became a demanding topic for industry as well as for academic fields. Hadoop is widely adopted in Cloud Computing environment for unstructured data. Hadoop is an open source, a java based distributed computing framework, and supports large-scale distributed data processing. In the recent years, Hadoop Distribu...

متن کامل

Parallelizing K-means with Hadoop/Mahout for Big Data Analytics

2015

Jianbin Cui Hongying Meng

The rapid development of Internet and cloud computing technologies has led to explosive generation and processing of huge amounts of data. The ever increasing data volumes bring great values to societies, but in the meantime bring forward a number of challenges. Data mining techniques have been widely used in decision analysis in financial, medical, management, business and many other fields. H...

متن کامل

Analysis and Optimization of the Hadoop Speculative Execution Mechanism

2016

XIANJIN LUO Xianjin LUO Chenggang ZHEN

The existing Hadoop clusters are mostly composed of heterogeneous nodes, which have different computing and storage capacities, with the speed of maps to reduce tasks performed on the nodes being quite different. However, the finish time of the entire job is determined by the slowest task, so looking for the “drag tasks” strategy has a dominant position in the whole job scheduling process. The ...

متن کامل

MapReduce Frameworks: Comparing Hadoop and HPCC

2016

Fabian Fier Eva Höfer Johann-Christoph Freytag

MapReduce and Hadoop are often used synonymously. For optimal runtime performance, Hadoop users have to consider various implementation details and configuration parameters. When conducting performance experiments with Hadoop on different algorithms, it is hard to choose a set of such implementation optimizations and configuration options which is fair to all algorithms. By fair we mean default...

متن کامل

Object-Tagged RBAC Model for the Hadoop Ecosystem

2017

Maanak Gupta Farhan Patwa Ravi S. Sandhu

Hadoop ecosystem provides a highly scalable, fault-tolerant and cost-effective platform for storing and analyzing variety of data formats. Apache Ranger and Apache Sentry are two predominant frameworks used to provide authorization capabilities in Hadoop ecosystem. In this paper we present a formal multi-layer access control model (called HeAC) for Hadoop ecosystem, as an academic-style abstrac...

متن کامل