科技报告详细信息
Sparse Indexing: Large Scale, Inline Deduplication Using Sampling and Locality
Lillibridge, Mark ; Eshghi, Kave ; Bhagwat, Deepavali ; Deolalikar, Vinay ; Trezise, Greg ; Camble, Peter
HP Development Company
关键词: deduplication;    storage;    scaling;    locality;    inline deduplication;    sparse index;    chunking;   
RP-ID  :  HPL-2009-122
学科分类:计算机科学(综合)
美国|英语
来源: HP Labs
PDF
【 摘 要 】

We present sparse indexing, a technique that uses sampling and exploits the inherent locality within backup streams to solve for large-scale backup (e.g., hundreds of terabytes) the chunk-lookup disk bottleneck problem that inline, chunk-based deduplication schemes face. The problem is that these schemes traditionally require a full chunk index, which indexes every chunk, in order to determine which chunks have already been stored; unfortunately, at scale it is impractical to keep such an index in RAM and a disk-based index with one seek per incoming chunk is far too slow. We perform stream deduplication by breaking up an incoming stream into relatively large segments and deduplicating each segment against only a few of the most similar previous segments. To identify similar segments, we use sampling and a sparse index. We choose a small portion of the chunks in the stream as samples; our sparse index maps these samples to the existing segments in which they occur. Thus, we avoid the need for a full chunk index. Since only the sampled chunks' hashes are kept in RAM and the sampling rate is low, we dramatically reduce the RAM to disk ratio for effective deduplication. At the same time, only a few seeks are required per segment so the chunk-lookup disk bottleneck is avoided. Sparse indexing has recently been incorporated into number of Hewlett-Packard backup products.

【 预 览 】
附件列表
Files Size Format View
RO201804100001296LZ 429KB PDF download
  文献评价指标  
  下载次数:28次 浏览次数:52次