科技报告详细信息
Increasing fault resiliency in a message-passing environment.
Stearley, Jon R. ; Riesen, Rolf E. ; Laros, James H., III ; Ferreira, Kurt Brian ; Pedretti, Kevin Thomas Tauke ; Oldfield, Ron A. ; Kordenbrock, Todd (Hewlett-Packard Company) ; Brightwell, Ronald Brian
Sandia National Laboratories
关键词: Errors;    Computer Networks;    Reliability;    Performance;    97 Mathematical Methods And Computing;   
DOI  :  10.2172/1001015
RP-ID  :  SAND2009-6753
RP-ID  :  AC04-94AL85000
RP-ID  :  1001015
美国|英语
来源: UNT Digital Library
PDF
【 摘 要 】

Petaflops systems will have tens to hundreds of thousands of compute nodes which increases the likelihood of faults. Applications use checkpoint/restart to recover from these faults, but even under ideal conditions, applications running on more than 30,000 nodes will likely spend more than half of their total run time saving checkpoints, restarting, and redoing work that was lost. We created a library that performs redundant computations on additional nodes allocated to the application. An active node and its redundant partner form a node bundle which will only fail, and cause an application restart, when both nodes in the bundle fail. The goal of this library is to learn whether this can be done entirely at the user level, what requirements this library places on a Reliability, Availability, and Serviceability (RAS) system, and what its impact on performance and run time is. We find that our redundant MPI layer library imposes a relatively modest performance penalty for applications, but that it greatly reduces the number of applications interrupts. This reduction in interrupts leads to huge savings in restart and rework time. For large-scale applications the savings compensate for the performance loss and the additional nodes required for redundant computations.

【 预 览 】
附件列表
Files Size Format View
1001015.pdf 9679KB PDF download
  文献评价指标  
  下载次数:19次 浏览次数:77次