学位论文详细信息
Rebound: Scalable checkpointing for coherent shared memory
Scalable Checkpointing;Shared-Memory Multiprocessors;Faults
Agarwal, Rishi ; Torrellas ; Josep
关键词: Scalable Checkpointing;    Shared-Memory Multiprocessors;    Faults;   
Others  :  https://www.ideals.illinois.edu/bitstream/handle/2142/24039/Agarwal_Rishi.pdf?sequence=1&isAllowed=y
美国|英语
来源: The Illinois Digital Environment for Access to Learning and Scholarship
PDF
【 摘 要 】

As we move to large manycores, the hardware-based global checkpointing schemes that havebeen proposed for small shared-memory machines do not scale. Scalability barriers include globaloperations, work lost to global rollback, and inefficiencies in imbalanced or I/O-intensive loads.Scalable checkpointing requires tracking inter-thread dependences and building the checkpointand rollback operations around dynamic groups of communicating processors.To address this problem, this paper introduces Rebound, the first hardware-based scheme for co-ordinated local checkpointing in multiprocessors with directory-based cache coherence. Reboundleverages the transactions of a directory protocol to track inter-thread dependences. In addition, itboosts checkpointing efficiency by: (i) delaying the writeback of data to safe memory at check-points, (ii) supporting operation with multiple checkpoints, and (iii) optimizing checkpointing atbarrier synchronization. Finally, Rebound introduces distributed algorithms for checkpointing androllback sets of processors. Simulations of parallel programs with up to 64 threads show that Re-bound is scalable and has very low overhead. For 64 processors, its average performance overheadis only 2%, compared to 15% for global checkpointing.

【 预 览 】
附件列表
Files Size Format View
Rebound: Scalable checkpointing for coherent shared memory 277KB PDF download
  文献评价指标  
  下载次数:20次 浏览次数:25次