hadoop

GreenHDFS: Towards An Energy-Conserving, Storage-Efficient, Hybrid Hadoop Compute Cluster

2010

Rini T. Kaushik Milind Bhandarkar

Hadoop Distributed File System (HDFS) presents unique challenges to the existing energy-conservation techniques and makes it hard to scale-down servers. We propose an energy-conserving, hybrid, logical multi-zoned variant of HDFS for managing dataprocessing intensive, commodity Hadoop cluster. Green HDFS’s data-classifica-tion-driven data placement allows scale-down by guaranteeing substantiall...

متن کامل

Hadoop neural network for parallel and distributed feature selection

Journal: :Neural networks : the official journal of the International Neural Network Society 2016

Victoria J. Hodge Simon O'Keefe Jim Austin

In this paper, we introduce a theoretical basis for a Hadoop-based neural network for parallel and distributed feature selection in Big Data sets. It is underpinned by an associative memory (binary) neural network which is highly amenable to parallel and distributed processing and fits with the Hadoop paradigm. There are many feature selectors described in the literature which all have various ...

متن کامل

Enhancing Data Processing on Clouds with Hadoop/HBase

2011

Chen Zhang

In the current information age, large amounts of data are being generated and accumulated rapidly in various industrial and scientific domains. This imposes important demands on data processing capabilities that can extract sensible and valuable information from the large amount of data in a timely manner. Hadoop, the open source implementation of Google’s data processing framework (MapReduce, ...

متن کامل

Benchmarking and Performance studies of MapReduce / Hadoop Framework on Blue Waters Supercomputer

2015

Manisha Gajbe Kalyana Chadalavada Gregory Bauer William Kramer

MapReduce is an emerging and widely used programming model for large-scale data parallel applications that require to process large amount of raw data. There are several implementations of MapReduce framework, among which Apache Hadoop is the most commonly used and open source implementaion. These frameworks are rarely deployed on supercomputers as massive as Blue Waters. We want to evaluate ho...

متن کامل

Adaptive Data Replication Scheme Based on Access Count Prediction in Hadoop

2013

Jungha Lee JongBeom Lim Daeyong Jung KwangSik Chung JoonMin Gil

Hadoop, an open source implementation of the MapReduce framework, has been widely used for processing massive-scale data in parallel. Since Hadoop uses a distributed file system, called HDFS, the data locality problem often happens (i.e., a data block should be copied to the processing node when a processing node does not possess the data block in its local storage), and this problem leads to t...

متن کامل

An Optimal Task Assignment Policy and Performance Diagnosis Strategy for Heterogeneous Hadoop Cluster

2013

Shekhar Gupta

The goal of the proposed research is to improve the performance of Hadoop-based software running on a heterogeneous cluster. My approach lies in the intersection of machine learning, scheduling and diagnosis. We mainly focus on heterogeneous Hadoop clusters and try to improve the performance by implementing a more efficient scheduler for this class of cluster.

متن کامل

Grammar based statistical MT on HadoopAn end-to-end toolkit for large scale PSCFG based MT

Journal: :Prague Bull. Math. Linguistics 2009

Ashish Venugopal Andreas Zollmann

This paper describes the open-source Syntax Augmented Machine Translation (SAMT) on Hadoop toolkit—an end-to-end grammar based machine statistical machine translation framework running on the Hadoop implementation of the MapReduce programming model. We present the underlying methodology of the SAMT approach with detailed instructions that describe how to use the toolkit to build grammar based s...

متن کامل

Scheduling in Hadoop An introduction to the pluggable scheduler framework

2013

M. Tim Jones

Hadoop implements the ability for pluggable schedulers that assign resources to jobs. However, as we know from traditional scheduling, not all algorithms are the same, and efficiency is workload and cluster dependent. Get to know Hadoop scheduling, and explore two of the algorithms available today: fair scheduling and capacity scheduling. Also, learn how these algorithms are tuned and in what s...

متن کامل

Taming the zoo - about algorithms implementation in the ecosystem of Apache Hadoop

Journal: :CoRR 2013

Piotr Jan Dendek Artur Czeczko Mateusz Fedoryszak Adam Kawa Piotr Wendykier Lukasz Bolikowski

Content Analysis System (CoAnSys) is a research framework for mining scientific publications using Apache Hadoop. This article describes the algorithms currently implemented in CoAnSys including classification, categorization and citation matching of scientific publications. The size of the input data classifies these algorithms in the range of big data problems, which can be efficiently solved...

متن کامل

Redoop: Supporting Recurring Queries in Hadoop

2014

Chuan Lei Elke A. Rundensteiner Mohamed Y. Eltabakh

The growing demand for large-scale data analytics ranging from online advertisement placement, log processing, to fraud detection, has led to the design of highly scalable data-intensive computing infrastructures such as the Hadoop platform. Recurring queries, repeatedly being executed for long periods of time on rapidly evolving high-volume data, have become a bedrock component in most of thes...

متن کامل