科技报告

【摘要】

Petaflops systems will have tens to hundreds of thousands of compute nodes which increases the likelihood of faults. Applications use checkpoint/restart to recover from these faults, but even under ideal conditions, applications running on more than 30,000 nodes will likely spend more than half of their total run time saving checkpoints, restarting, and redoing work that was lost. We created a library that performs redundant computations on additional nodes allocated to the application. An active node and its redundant partner form a node bundle which will only fail, and cause an application restart, when both nodes in the bundle fail. The goal of this library is to learn whether this can be done entirely at the user level, what requirements this library places on a Reliability, Availability, and Serviceability (RAS) system, and what its impact on performance and run time is. We find that our redundant MPI layer library imposes a relatively modest performance penalty for applications, but that it greatly reduces the number of applications interrupts. This reduction in interrupts leads to huge savings in restart and rework time. For large-scale applications the savings compensate for the performance loss and the additional nodes required for redundant computations.

【预览】

附件列表
Files	Size	Format	View
1001015.pdf	9679KB	PDF	download


Increasing fault resiliency in a message-passing environment.

Stearley, Jon R. ; Riesen, Rolf E. ; Laros, James H., III ; Ferreira, Kurt Brian ; Pedretti, Kevin Thomas Tauke ; Oldfield, Ron A. ; Kordenbrock, Todd (Hewlett-Packard Company) ; Brightwell, Ronald Brian
Sandia National Laboratories
关键词: Errors; Computer Networks; Reliability; Performance; 97 Mathematical Methods And Computing;
DOI : 10.2172/1001015 RP-ID : SAND2009-6753 RP-ID : AC04-94AL85000 RP-ID : 1001015
美国\|英语
来源: UNT Digital Library
PDF


	文献评价指标
	下载次数：26次	浏览次数：77次

【 摘 要 】

【 预 览 】

【摘要】

【预览】