hadoop

Impala: A Modern, Open-Source SQL Engine for Hadoop

2015

Marcel Kornacker Alexander Behm Victor Bittorf Taras Bobrovytsky Casey Ching Alan Choi Justin Erickson Martin Grund Daniel Hecht Matthew Jacobs Ishaan Joshi Lenni Kuff Dileep Kumar Alex Leblang Nong Li Ippokratis Pandis Henry Robinson David Rorke Silvius Rus John Russell Dimitris Tsirogiannis Skye Wanderman-Milne Michael Yoder

Cloudera Impala is a modern, open-source MPP SQL engine architected from the ground up for the Hadoop data processing environment. Impala provides low latency and high concurrency for BI/analytic read-mostly queries on Hadoop, not delivered by batch frameworks such as Apache Hive. This paper presents Impala from a user’s perspective, gives an overview of its architecture and main components and...

متن کامل

A comprehensive view of Hadoop research - A systematic literature review

Journal: :J. Network and Computer Applications 2014

Ivanilton Polato Reginaldo Ré Alfredo Goldman Fabio Kon

Context: In recent years, the valuable knowledge that can be retrieved from petabyte scale datasets – known as Big Data – led to the development of solutions to process information based on parallel and distributed computing. Lately, Apache Hadoop has attracted strong attention due to its applicability to Big Data processing. Problem: The support of Hadoop by the research community has provided...

متن کامل

FP-Hadoop: Efficient Execution of Parallel Jobs Over Skewed Data

Journal: :PVLDB 2015

Miguel Liroz-Gistau Reza Akbarinia Patrick Valduriez

Big data parallel frameworks, such as MapReduce or Spark have been praised for their high scalability and performance, but show poor performance in the case of data skew. There are important cases where a high percentage of processing in the reduce side ends up being done by only one node. In this demonstration, we illustrate the use of FP-Hadoop, a system that efficiently deals with data skew ...

متن کامل

Improving the Performance of Processing for Small Files in Hadoop: A Case Study of Weather Data Analytics

2014

Guru Prasad

-Hadoop is an open source Apache project that supports master slave architecture, which involves one master node and thousands of slave nodes. Master node acts as the name node, which stores all the metadata of files and slave nodes acts as the data nodes, which stores all the application data. Hadoop is designed to process large data sets (petabytes). It becomes a bottleneck, when handling mas...

متن کامل

Hadoop performance modeling and job optimization for big data analytics

2015

Mukhtaj Khan

Big data has received a momentum from both academia and industry. The MapReduce model has emerged into a major computing model in support of big data analytics. Hadoop, which is an open source implementation of the MapReduce model, has been widely taken up by the community. Cloud service providers such as Amazon EC2 cloud have now supported Hadoop user applications. However, a key challenge is ...

متن کامل

Sentiment Analysis on Hadoop with Hadoop Streaming

2015

Piyush Gupta Pardeep Kumar Girdhar Gopal Sungyoung Lee Xiaoqian Zhang Shoushan Li Guodong Zhou Hongxia Zhao Fan Miao Zhou Xiaoxia Hui Song Yingxiang Fan Xiaoqiang Liu Dao Tao Bin Wen Wenhua Dai Junzhe Zhao Zhongqing Wang

Ideas and opinions of peoples are influenced by the opinions of other peoples. Lot of research is going on analysis of reviews given by peoples. Sentiment analysis is the major computational technique to calculate or observe sentiments of people's thoughts. Therefore, a method that assigns scores indicating positive and negative opinion about the product is proposed. It uses Hadoop Distrib...

متن کامل

An adaptive scheduling algorithm for dynamic heterogeneous Hadoop systems

2011

Aysan Rasooli Oskooei Douglas G. Down

The MapReduce and Hadoop frameworks were designed to support efficient large scale computations. There has been growing interest in employing Hadoop clusters for various diverse applications. A large number of (heterogeneous) clients, using the same Hadoop cluster, can result in tensions between the various performance metrics by which such systems are measured. On the one hand, from the servic...

متن کامل

Improved Fair Scheduling Algorithm for Hadoop Clustering SNEHA and SHONEY SEbASTIAN

2017

Traditional way of storing such a huge amount of data is not convenient because processing those data in the later stages is very tedious job. So nowadays, Hadoop is used to store and process large amount of data. When we look at the statistics of data generated in the recent years it is very high in the last 2 years. Hadoop is a good framework to store and process data efficiently. It works li...

متن کامل

New Framework For Improving Big Data Analysis Using Mobile Agent

2014

Youssef M. ESSA Gamal ATTIYA Ayman EL-SAYED

the rising number of applications serving millions of users and dealing with terabytes of data need to a faster processing paradigms. Recently, there is growing enthusiasm for the notion of big data analysis. Big data analysis becomes a very important aspect for growth productivity, reliability and quality of services (QoS). Processing of big data using a powerful machine is not efficient solut...

متن کامل

Implementation of image processing system using handover technique with map reduce based on big data in the cloud environment

Journal: :Int. Arab J. Inf. Technol. 2016

Mehraj Ali John Kumar

Cloud computing is the one of the emerging techniques to process the big data. Cloud computing is also, known as service on demand. Large set or large volume of data is known as big data. Processing big data (MRI images and DICOM images) normally takes more time. Hard tasks such as handling big data can be solved by using the concepts of hadoop. Enhancing the hadoop concept will help the user t...

متن کامل