Improving Backup and Restore Performance for Deduplication-based Cloud Backup Services

نویسنده

  • Stephen Mkandawire
چکیده

The benefits provided by cloud computing and the space savings offered by data deduplication make it attractive to host data storage services like backup in the cloud. Data deduplication relies on comparing fingerprints of data chunks, and store them in the chunk index, to identify and remove redundant data, with an ultimate goal of saving storage space and network bandwidth. However, the chunk index presents a bottleneck to the throughput of the backup operation. While several solutions to address deduplication throughput have been proposed, the chunk index is still a centralized resource and limits the scalability of both storage capacity and backup throughput in public cloud environments. In addressing this challenge, we propose the Scalable Hybrid Hash Cluster (SHHC) that hosts a low-latency distributed hash table for storing fingerprints. SHHC is a cluster of nodes designed to scale and handle numerous concurrent backup requests while maintaining high fingerprint lookup throughput. Each node in the cluster features hybrid memory consisting of DRAM and Solid State Drives (SSDs) to present a large usable memory for storing the chunk index. Our evaluation with real-world workloads shows that SHHC is consistently scalable as the number of nodes increases. The throughput increases almost linearly with the number of nodes. The restore performance over the relatively low bandwidth wide area network (WAN) links is another drawback in the use of cloud backup services. High speed network connectivity is either too expensive for most organizations or reserved for special applications. Removing redundant data before transmitting over the WAN offers a viable option to improve network throughput during the restore operation. To that end, we propose Application-Aware Phased Restore (AAPR), a simple restore solution for deduplication-based cloud backup clients. AAPR improves restore time by removing redundant data before transmitting over the WAN. Furthermore, we exploit application awareness to restore critical data first and thus improve the recovery time. Our evaluations show that, for workloads with high redundancy, AAPR reduces restore time by over 85%. iv ACKNOWLEDGEMENTS I would like to express my thanks first and foremost to the Almighty God without whom all this is meaningless. I would like to deeply thank my advisor, Dr. Hong Jiang, for his support and guidance. Thank you for being patient with me and guiding me not only academically but also on other natters of life. In you I found a true mentor and it's an honor and privilege to be advised …

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

A Novel Way of Deduplication Approach for Cloud Backup Services Using Block Index Caching Technique

Data Deduplication describes approach that reduces the storage capacity needed to store data or the data has to be transfer on the network. Cloud storage has received increasing attention from industry as it offers infinite storage resources that are available on demand. Source Deduplication is useful in cloud backup that saves network bandwidth and reduces network space Deduplication is the pr...

متن کامل

Towards Cloud Security Improvement with Encryption Intensity Selection

The emergence of cloud computing has enabled users to store and manage data at a low cost and high availability, which made outsourcing data become appealing to its customers. Since the inception of cloud computing, efforts have been dedicated to improve the security and performance of cloudbased backup services. However, there is yet to be implemented a complete scheme that provides security, ...

متن کامل

Design Tradeoffs for Data Deduplication Performance in Backup Workloads

Data deduplication has become a standard component in modern backup systems. In order to understand the fundamental tradeoffs in each of its design choices (such as prefetching and sampling), we disassemble data deduplication into a large N-dimensional parameter space. Each point in the space is of various parameter settings, and performs a tradeoff among backup and restore performance, memory ...

متن کامل

Improving restore speed for backup systems that use inline chunk-based deduplication

Slow restoration due to chunk fragmentation is a serious problem facing inline chunk-based data deduplication systems: restore speeds for the most recent backup can drop orders of magnitude over the lifetime of a system. We study three techniques—increasing cache size, container capping, and using a forward assembly area— for alleviating this problem. Container capping is an ingest-time operati...

متن کامل

An Optimization of Backup Storage using Backup History and Cache Knowledge in reducing Data Fragmentation for In_line deduplication in Distributed

The chunks of data that are generated after the backup are physically distributed after deduplication in backup system, which creates a problem know as fragmentation. Basically fragmentation basically comes into sparse and outof-order containers. The sparse container adversely affect the performance while restoring the database and garbage collection effectively , while the out-of-order contain...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2016