VSFS: A Versatile Searchable File System for HPC Analytics
نویسندگان
چکیده
Emerging HPC analytics applications urgently demand filesearch services to drastically reduce the scale of the input data in real-time, so that the speed of computation and data analytics can be greatly accelerated. Unfortunately, the existing file-search solutions are either poorly scalable for large-scale systems, or lack a well-integrated interface to allow applications to easily use them for critical tasks. We believe that the time is ripe for the design of a searchable file system capable of accurate and scalable system-level filesearch functionality. In this paper, we propose a Versatile Searchable File System, VSFS, which provides a transparent, accurate and real-time file-search service through a POSIX-compatible file system namespace that can be integrated into any HPC/Big Data legacy code without modifications. Additionally, to support real-time file search, VSFS uses a DRAM-based distributed architecture to perform real-time file indexing. Moreover, a versatile index scheme is designed to adapt to the various forms of HPC datasets. The results of our VSFS prototype evaluation show that VSFS is scalable in a typical HPC environment. It achieves significantly better file-indexing and file-search performance than the popular SQL/NoSQL solutions, while it only introduces negligible I/O overhead. Finally, we integrate VSFS to a scientific analytics application to show its benefits in terms of performance and ease of use.
منابع مشابه
Providing Flexible File-Level Data Filtering for Big Data Analytics
The enormous amount of big data datasets impose the needs for effective data filtering technique to accelerate the analytics process. We propose a Versatile Searchable File System, VSFS, which provides a transparent, flexible and near real-time file-level data filtering service by searching files directly through the file system. Therefore, big data analytics applications can transparently util...
متن کاملPropeller: A Scalable Metadata Organization for A Versatile Searchable File System
The exponentially increasing amount of data in file systems has made it increasingly important for file systems to provide fast file-search services. The quality of the file-search services is significantly affected by the file-index overhead, the file-search responsiveness and the accuracy of search results. Unfortunately, the existing file-search solutions either are so poorly scalable that t...
متن کاملMicrosoft Word - EvaluationOfJava_ieeeformat_2.docx
Abstract—In the last few years, Java gain popularity in processing “big data” mostly with Apache big data stack – a collection of open source frameworks dealing with abundant data, which includes several popular systems such as Hadoop, Hadoop Distributed File System (HDFS), and Spark. Efforts have been made to introduce Java to High Performance Computing (HPC) as well in the past, but were no...
متن کاملHPC and Big Data Convergence for Extreme Heterogeneous Systems
As the data deluge grows ever greater, large-scale data analytics workloads are quickly becoming critical computational tools within the scientific community. Recently, convergence efforts have focused on combining aspects HPC and ”big data” analytics workloads together using a unified supercomputing system. This has the opportunity to bring advanced analytical tools to scientists which enable ...
متن کاملA Fuzzy TOPSIS Approach for Big Data Analytics Platform Selection
Big data sizes are constantly increasing. Big data analytics is where advanced analytic techniques are applied on big data sets. Analytics based on large data samples reveals and leverages business change. The popularity of big data analytics platforms, which are often available as open-source, has not remained unnoticed by big companies. Google uses MapReduce for PageRank and inverted indexes....
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2013