科技报告详细信息
Cruz: Application-Transparent Distributed Checkpoint- Restart on Standard Operating Systems
Janakiraman, G. (John) ; Santos, Jose Renato ; Subhraveti, Dinesh ; Turner, Yoshio
HP Development Company
关键词: checkpointing;    snapshot;    process migration;    error recovery;    availability;   
RP-ID  :  HPL-2005-66
学科分类:计算机科学(综合)
美国|英语
来源: HP Labs
PDF
【 摘 要 】
We present a new distributed checkpoint-restart mechanism, Cruz, that works without requiring application, library, or base kernel modifications. This mechanism provides comprehensive support for checkpointing and restoring application state, both at user level and within the OS. Our implementation builds on Zap, a process migration mechanism, implemented as a Linux kernel module, which operates by interposing a thin layer between applications and the OS. In particular, we enable support for networked applications by adding migratable IP and MAC addresses, and checkpoint-restart of socket buffer state, socket options, and TCP state. We leverage this capability to devise a novel method for coordinated checkpoint-restart that is simpler than prior approaches. For instance, it eliminates the need to flush communication channels by exploiting the packet re-transmission behavior of TCP and existing OS support for packet filtering. Our experiments show that the overhead of coordinating checkpoint-restart is negligible, demonstrating the scalability of this approach. Notes: Copyright IEEE. To be published in and presented at The International Conference on Dependable Systems and Networks (DSN-2005), 28 June -1 July 2005, Yokohama, Japan 10 Pages
【 预 览 】
附件列表
Files Size Format View
RO201804100000882LZ 184KB PDF download
  文献评价指标  
  下载次数:11次 浏览次数:29次