科技报告

【摘要】

We present sparse indexing, a technique that uses sampling and exploits the inherent locality within backup streams to solve for large-scale backup (e.g., hundreds of terabytes) the chunk-lookup disk bottleneck problem that inline, chunk-based deduplication schemes face. The problem is that these schemes traditionally require a full chunk index, which indexes every chunk, in order to determine which chunks have already been stored; unfortunately, at scale it is impractical to keep such an index in RAM and a disk-based index with one seek per incoming chunk is far too slow. We perform stream deduplication by breaking up an incoming stream into relatively large segments and deduplicating each segment against only a few of the most similar previous segments. To identify similar segments, we use sampling and a sparse index. We choose a small portion of the chunks in the stream as samples; our sparse index maps these samples to the existing segments in which they occur. Thus, we avoid the need for a full chunk index. Since only the sampled chunks' hashes are kept in RAM and the sampling rate is low, we dramatically reduce the RAM to disk ratio for effective deduplication. At the same time, only a few seeks are required per segment so the chunk-lookup disk bottleneck is avoided. Sparse indexing has recently been incorporated into number of Hewlett-Packard backup products.

【预览】

附件列表
Files	Size	Format	View
RO201804100001296LZ	429KB	PDF	download


Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality

Lillibridge, Mark ; Eshghi, Kave ; Bhagwat, Deepavali ; Deolalikar, Vinay ; Trezise, Greg ; Camble, Peter
HP Development Company
关键词: deduplication; storage; scaling; locality; inline deduplication; sparse index; chunking;
RP-ID : HPL-2009-122
学科分类：计算机科学（综合）
美国\|英语
来源: HP Labs
PDF


	文献评价指标
	下载次数：34次	浏览次数：53次

【 摘 要 】

【 预 览 】

【摘要】

【预览】