hadoop

Design and Implementation of K-Means and Hierarchical Document Clustering on Hadoop

2014

Y. K. Patil V. S. Nandedkar

Document clustering is one of the important areas in data mining. Hadoop is being used by the Yahoo, Google, Face book and Twitter business companies for implementing real time applications. Email, social media blog, movie review comments, books are used for document clustering. This paper focuses on the document clustering using Hadoop. Hadoop is the new technology used for parallel computing ...

متن کامل

High Scalability of HDFS using Distributed Namespace

2012

Harcharan Jit Singh V. P. Singh

In data intensive computing, Hadoop is widely used by organizations. The client applications of Hadoop require high availability and scalability of the system. Mostly, these applications are online and their data growth rate is unpredictable. The present Hadoop relies on secondary namenode for failover which slows down the performance of the system. Hadoop system’s scalability depends on the ve...

متن کامل

Hadoop Mapreduce for Remote Sensing Image Analysis

2012

Mohamed H. Almeer

Image processing algorithms related to remote sensing have been tested and utilized on the Hadoop MapReduce parallel platform by using an experimental 112core high-performance cloud computing system that is situated in the Environmental Studies Center at the University of Qatar. Although there has been considerable research utilizing the Hadoop platform for image processing rather than for its ...

متن کامل

Dynamic Processing Slots Scheduling for I/o Intensive Jobs of Hadoop on Pathology Data

2014

K. H Naz Anisha Rodrigues

The increasing use of computing resource in our daily lives leads to data generation at an astonishing rate. The computing industry is being repeatedly questioned for its ability to accommodate the unpredictable growth rate of data.It has encouraged the development. Hadoop consists of Hadoop Mapreduce and Hadoop Distributed File System (HDFS), is a platform for large scale data and processing. ...

متن کامل

Optimizing data management for MapReduce applications on large-scale distributed infrastructures

2012

Diana Maria Moise Frédéric DESPREZ Gabriel ANTONIU

ions were developed based on MapReduce, with the goal of providing a simple-touse interface for expressing database-like queries [64, 6]. Bioinformatics is one of the numerous research domains that employ MapReduce to model their algorithms [69, 58, 56]. As an example, CloudBurst [69] is a MapReduce-based algorithm for mapping next-generation sequence data to the human genome and other referenc...

متن کامل

Failure Analysis of Hadoop Schedulers using an Integration of Model Checking and Simulation

Journal: :Electronic proceedings in theoretical computer science 2021

The Hadoop scheduler is a centerpiece of Hadoop, the leading processing framework for data-intensive applications in cloud. Given impact failures on performance running testing and verifying critical. Existing approaches such as simulation analytical modeling are inadequate because they not able to ascertain complete verification scheduler. This due wide range constraints aspects involved Hadoo...

متن کامل

MPJ Express Meets YARN: Towards Java HPC on Hadoop Systems

2015

Hamza Zafar Farrukh Aftab Khan Bryan Carpenter Aamir Shafi Asad Waqar Malik

Many organizations—including academic, research, commercial institutions—have invested heavily in setting up High Performance Computing (HPC) facilities for running computational science applications. On the other hand, the Apache Hadoop software—after emerging in 2005— has become a popular, reliable, and scalable open-source framework for processing large-scale data (Big Data). Realizing the i...

متن کامل

Maximizing Data Locality in Hadoop Clusters via Controlled Reduce Task Scheduling

2011

The overall goal of this project is to gain a hands-on experience with working on a large open-ended research-oriented project using the Hadoop framework. Hadoop is an open source implementation of MapReduce and Google File System, and is currently enjoying wide popularity. Students will modify the task scheduler of Hadoop, conduct several experimental studies, and analyze performance and netwo...

متن کامل

From Relational Database Management to Big Data: Solutions for Data Migration Testing

2015

Enter Hadoop, the de facto open source standard that is increasingly being used by many companies in large data migration projects. Hadoop is an open-source framework that allows for the distributed processing of large data sets. It is designed to scale up from single servers to thousands of machines, each offering local computation and storage. As data from different sources flows into Hadoop,...

متن کامل

Comparing Hadoop and Fat-Btree Based Access Method for Small File I/O Applications

2010

Min Luo Haruo Yokota

Hadoop has been widely used in various clusters to build scalable and high performance distributed file systems. However, Hadoop distributed file system (HDFS) is designed for large file management. In case of small files applications, those metadata requests will flood the network and consume most of the memory in Namenode thus sharply hinders its performance. Therefore, many web applications ...

متن کامل