As we move to large manycores, the hardware-based global checkpointing schemes that havebeen proposed for small shared-memory machines do not scale. Scalability barriers include globaloperations, work lost to global rollback, and inefficiencies in imbalanced or I/O-intensive loads.Scalable checkpointing requires tracking inter-thread dependences and building the checkpointand rollback operations around dynamic groups of communicating processors.To address this problem, this paper introduces Rebound, the first hardware-based scheme for co-ordinated local checkpointing in multiprocessors with directory-based cache coherence. Reboundleverages the transactions of a directory protocol to track inter-thread dependences. In addition, itboosts checkpointing efficiency by: (i) delaying the writeback of data to safe memory at check-points, (ii) supporting operation with multiple checkpoints, and (iii) optimizing checkpointing atbarrier synchronization. Finally, Rebound introduces distributed algorithms for checkpointing androllback sets of processors. Simulations of parallel programs with up to 64 threads show that Re-bound is scalable and has very low overhead. For 64 processors, its average performance overheadis only 2%, compared to 15% for global checkpointing.
【 预 览 】
附件列表
Files
Size
Format
View
Rebound: Scalable checkpointing for coherent shared memory