An important set of challenges emerge as the High Performance Computing (HPC)community aims to reach extreme scale. Resilience and energy consumption are twoof those challenges. Extreme-scale machines are expected to have a high failurefrequency. This is an inevitable consequence of the mismatch between two trends.The number of components assembled in supercomputers grows exponentially.However, the improvement on the reliability of each individual component is muchslower. At the same time, the vast number of components in a single machine willconsume a non-trivial amount of energy. To keep a supercomputer withinoperational margins, HPC systems have to be both reliable and energy-aware.Foran application to be able to run and make progress in spite of constantinterruptions, it has to incorporate some fashion of fault tolerance.Rollback-recovery techniques provide a framework to overcome crashes in thesystem by periodically saving the state of the application and rolling back tocheckpoints in case of failures. Two well-known rollback-recovery techniques arecheckpoint/restart and message-logging. The former is easier to implement andhas become the de facto standard to make applications fault tolerant. Ithas, however, a high performance and energy cost during recovery.Message-logging, on the other hand, makes it possible to recover faster from afailure and to consume less energy. The downside of message-logging is theoverhead it exhibits in the failure-free scenario. Memory and performanceoverheads may offset its advantages. This thesis focuses on techniques toalleviate the downsides of message-logging. It presents a mechanism based onhigh-level programming language constructs to decrease the performance overheadof message-logging. It also introduces two strategies to reduce the memoryoverhead created by the message log. Additionally, it addresses importantarchitectural constraints of modern supercomputers. Based on large-scaleexperimental results and projections from an analytical model, we concludemessage-logging is a promising strategy to provide fault tolerance at a lowenergy cost for extreme-scale machines.
【 预 览 】
附件列表
Files
Size
Format
View
Scalable message-logging techniques for effective fault tolerance in HPC applications