hadoop

CoHadoop: Flexible Data Placement and Its Exploitation in Hadoop

Journal: :PVLDB 2011

Mohamed Y. Eltabakh Yuanyuan Tian Fatma Özcan Rainer Gemulla Aljoscha Krettek John McPherson

Hadoop has become an attractive platform for large-scale data analytics. In this paper, we identify a major performance bottleneck of Hadoop: its lack of ability to colocate related data on the same set of nodes. To overcome this bottleneck, we introduce CoHadoop, a lightweight extension of Hadoop that allows applications to control where data are stored. In contrast to previous approaches, CoH...

متن کامل

The Quantcast File System

2013

Michael Ovsiannikov Silvius Rus Damian Reeves Paul Sutter Sriram Rao Jim Kelly

The Quantcast File System (QFS) is an efficient alternative to the Hadoop Distributed File System (HDFS). QFS is written in C++, is plugin compatible with Hadoop MapReduce, and offers several efficiency improvements relative to HDFS: 50% disk space savings through erasure coding instead of replication, a resulting doubling of write throughput, a faster name node, support for faster sorting and ...

متن کامل

Cluster Computing Paradigms– A Comparative study of Evolving Frameworks

2016

N. Anila Sundar Vijay Krishna Menon P. N. Kumar

Cluster computing is an approach for storing and processing huge amount of data that is being generated. Hadoop and Spark are the two cluster computing platforms which are prominent today. Hadoop incorporates the MapReduce concept and is scalable as well as fault-tolerant. But the limitations of Hadoop paved way for another cluster computing framework named Spark. It is faster and can also mana...

متن کامل

A Relative Study on Task Schedulers in Hadoop MapReduce

2013

Jisha S Manjaly

Hadoop is a framework for BigData processing in distributed applications. Hadoop cluster is built for running data intensive distributed applications. Hadoop distributed file system is the primary storage area for BigData. MapReduce is a model to aggregate tasks of a job. Task assignment is possible by schedulers. Schedulers guarantee the fair allocation of resources among users. When a user su...

متن کامل

Analysis of Information Management and Scheduling Technology in Hadoop

Journal: :JDIM 2014

Weihua Ma Hong Zhang Qianmu Li Bin Xia

Development of big data computing has brought many changes to society and social life is constantly digitized. ‘How to handle vast amounts of data’ has become a more and more fashionable topic. Hadoop is a distributed computing software framework, which includes HDFS and MapReduce distributed computing method, make distributed processing huge amounts of data possible. Then job scheduler determi...

متن کامل

Tuning Hadoop Map Slot Value Using CPU Metric

2014

Kamal Kc Vincent W. Freeh

Hadoop is a widely used open source mapreduce framework. Its performance is critical because it increases the usefulness of products and services for a large number of companies who have adopted Hadoop for their business purposes. One of the configuration parameters that influences the resource allocation and thus the performance of a Hadoop application is map slot value (MSV). MSV determines t...

متن کامل

Hone: "Scaling Down" Hadoop on Shared-Memory Systems

Journal: :PVLDB 2013

K. Ashwin Kumar Jonathan Gluck Amol Deshpande Jimmy J. Lin

The underlying assumption behind Hadoop and, more generally, the need for distributed processing is that the data to be analyzed cannot be held in memory on a single machine. Today, this assumption needs to be re-evaluated. Although petabyte-scale datastores are increasingly common, it is unclear whether “typical” analytics tasks require more than a single high-end server. Additionally, we are ...

متن کامل

Cloud Hadoop Map Reduce For Remote Sensing Image Analysis

2012

Mohamed H. Almeer

Image processing algorithms related to remote sensing have been tested and utilized on the Hadoop MapReduce parallel platform by using an experimental 112-core high-performance cloud computing system that is situated in the Environmental Studies Center at the University of Qatar. Although there has been considerable research utilizing the Hadoop platform for image processing rather than for its...

متن کامل

PigSPARQL: Übersetzung von SPARQL nach Pig Latin

2011

Alexander Schätzle Martin Przyjaciel-Zablocki Thomas Hornung Georg Lausen

Dieser Beitrag untersucht die effiziente Auswertung von SPARQLAnfragen auf großen RDF-Datensätzen. Zum Einsatz kommt hierfür das Apache Hadoop Framework, eine bekannte Open-Source Implementierung von Google's MapReduce, das massiv parallelisierte Berechnungen auf einem verteilten System ermöglicht. Zur Auswertung von SPARQL-Anfragen mit Hadoop wird in diesem Beitrag PigSPARQL, eine Übersetzung ...

متن کامل

An Efficient Data Indexing Approach on Hadoop Using Java Persistence API

2010

Lai Yang Zhongzhi Shi

Data indexing is common in data mining when working with high-dimensional, large-scale data sets. Hadoop, a cloud computing project using the MapReduce framework in Java, has become of significant interest in distributed data mining. To resolve problems of globalization, random-write and duration in Hadoop, a data indexing approach on Hadoop using the Java Persistence API (JPA) is elaborated in...

متن کامل