apache spark

Towards a distributed, scalable and real-time RDF Stream Processing engine

2016

Xiangnan Ren

Due to the growing need to timely process and derive valuable information and knowledge from data produced in the Semantic Web, RDF stream processing (RSP) has emerged as an important research domain. Of course, modern RSP have to address the volume and velocity characteristics encountered in the Big Data era. This comes at the price of designing high throughput, low latency, fault tolerant, hi...

متن کامل

Document Classification Using Distributed Machine Learning

Journal: :CoRR 2015

Galip Aydin Ibrahim Riza Hallac

In this paper, we investigate the performance and success rates of Naïve Bayes Classification Algorithm for automatic classification of Turkish news into predetermined categories like economy, life, health etc. We use Apache Big Data technologies such as Hadoop, HDFS, Spark and Mahout, and apply these distributed technologies to Machine Learning. Keywords—news classification, distributed machin...

متن کامل

Anatomy of machine learning algorithm implementations in MPI, Spark, and Flink

Journal: :IJHPCA 2018

Supun Kamburugamuve Pulasthi Wickramasinghe Saliya Ekanayake Geoffrey C. Fox

With the ever-increasing need to analyze large amounts of data to get useful insights, it is essential to develop complex parallel machine learning algorithms that can scale with data and number of parallel processes. These algorithms need to run on large data sets as well as they need to be executed with minimal time in order to extract useful information in a time constrained environment. MPI...

متن کامل

Real-Time Analysis of Students’ Activities on an

2017

Abdelmajid Chaffai Larbi Hassouni Houda Anoun

Real time analytics is the capacity to extract valuables insights from data that comes continuously from activities on the web or network sensors. It is largely used in web based business to drive decisions based on user’s experiences, such dynamic pricing and personalized advertising. Many universities have adopted web based learning in their learning process. They use data-mining techniques t...

متن کامل

SPARK: Personalized Parkinson Disease Interventions through Synergy between a Smartphone and a Smartwatch

2014

Vinod Sharma Kunal Mankodiya Fernando De la Torre Ada Zhang Neal Ryan Thanh G. N. Ton Rajeev Gandhi Samay Jain

Parkinson disease (PD) is a neurodegenerative disorder afflicting more than 1 million aging Americans, incurring $23 billion in annual medical costs in the U.S. alone. Approximately 90% Parkinson patients undergoing treatment have mobility related problems related to medication which prevent them doing their activities of daily living. Efficient management of PD requires complex medication regi...

متن کامل

Scaling Spark on Lustre

2016

Nicholas Chaimov Allen D. Malony Costin Iancu Khaled Z. Ibrahim

We report our experiences in porting and tuning the Apache Spark data analytics framework on the Cray XC30 (Edison) and XC40 (Cori) systems, installed at NERSC. We find that design decisions made in the development of Spark are based on the assumption that Spark is constrained primarily by network latency, and that disk I/O is comparatively cheap. These assumptions are not valid on Edison or Co...

متن کامل

Dynamic Multi-Objective Optimization with jMetal and Spark: A Case Study

2016

José A. Cordero Antonio J. Nebro Cristóbal Barba-González Juan José Durillo José García-Nieto Ismael Navas Delgado José Francisco Aldana Montes

Technologies for Big Data and Data Science are receiving increasing research interest nowadays. This paper introduces the prototyping architecture of a tool aimed to solve Big Data Optimization problems. Our tool combines the jMetal framework for multi-objective optimization with Apache Spark, a technology that is gaining momentum. In particular, we make use of the streaming facilities of Spark...

متن کامل

Emma in Action: Deklarative Datenflüsse für Skalierbare Datenanalyse

2017

Alexander Alexandrov Georgi Krastev Bernd Louis Andreas Salzmann Volker Markl

Schnittstellen zur Programmierung paralleler DatenĆüsse, die auf Funktionen höherer Ordnung (wie map und reduce) basieren, sind in den letzten zehn Jahren durch Systeme wie Apache Hadoop, Apache Flink und Apache Spark populär geworden. Im Gegensatz zu SQL werden solche Programmierschnittstellen in Form eingebetteter DomänenspeziĄscher Sprachen (eDSLs) realisiert. Im Kern jeder eDSL steht ein de...

متن کامل

On the Evaluation of RDF Distribution Algorithms Implemented over Apache Spark

2015

Olivier Curé Hubert Naacke Mohamed Amine Baazizi Bernd Amann

Querying very large RDF data sets in an efficient and scalable manner requires parallel query plans combined with appropriate data distribution strategies. Several innovative solutions have recently been proposed for optimizing data distribution with or without predefined query workloads. This paper presents an in-depth analysis and experimental comparison of five representative RDF data distri...

متن کامل

Pipeline for Real-time Anomaly Detection in Log Data Streams using Apache Kafka and Apache Spark

Journal: :International Journal of Computer Applications 2018

متن کامل