科技报告详细信息
rMPI : increasing fault resiliency in a message-passing environment.
Stearley, Jon R. ; Laros, James H., III ; Ferreira, Kurt Brian ; Pedretti, Kevin Thomas Tauke ; Oldfield, Ron A. ; Riesen, Rolf (IBM Research, Ireland) ; Brightwell, Ronald Brian
关键词: COMPUTERS;    DESIGN;    RELIABILITY;    REPLICAS;    TOLERANCE;   
DOI  :  10.2172/1012733
RP-ID  :  SAND2011-2488
PID  :  OSTI ID: 1012733
Others  :  TRN: US201110%%453
学科分类:社会科学、人文和艺术(综合)
美国|英语
来源: SciTech Connect
PDF
【 摘 要 】

As High-End Computing machines continue to grow in size, issues such as fault tolerance and reliability limit application scalability. Current techniques to ensure progress across faults, like checkpoint-restart, are unsuitable at these scale due to excessive overheads predicted to more than double an applications time to solution. Redundant computation, long used in distributed and mission critical systems, has been suggested as an alternative to checkpoint-restart on its own. In this paper we describe the rMPI library which enables portable and transparent redundant computation for MPI applications. We detail the design of the library as well as two replica consistency protocols, outline the overheads of this library at scale on a number of real-world applications, and finally outline the significant increase in an applications time to solution at extreme scale as well as show the scenarios in which redundant computation makes sense.

【 预 览 】
附件列表
Files Size Format View
RO201704210002637LZ 4563KB PDF download
  文献评价指标  
  下载次数:21次 浏览次数:6次