科技报告详细信息
Cooperative application/OS DRAM fault recovery.
Ferreira, Kurt Brian ; Bridges, Patrick G. (University of New Mexico, Albuquerque, NM) ; Heroux, Michael Allen ; Hoemmen, Mark ; Brightwell, Ronald Brian
Sandia National Laboratories
关键词: Computers;    Convergence;    Programming;    Computer Codes;    99 General And Miscellaneous//Mathematics, Computing, And Information Science;   
DOI  :  10.2172/1044954
RP-ID  :  SAND2012-4059
RP-ID  :  AC04-94AL85000
RP-ID  :  1044954
美国|英语
来源: UNT Digital Library
PDF
【 摘 要 】
Exascale systems will present considerable fault-tolerance challenges to applications and system software. These systems are expected to suffer several hard and soft errors per day. Unfortunately, many fault-tolerance methods in use, such as rollback recovery, are unsuitable for many expected errors, for example DRAM failures. As a result, applications will need to address these resilience challenges to more effectively utilize future systems. In this paper, we describe work on a cross-layer application/OS framework to handle uncorrected memory errors. We illustrate the use of this framework through its integration with a new fault-tolerant iterative solver within the Trilinos library, and present initial convergence results.
【 预 览 】
附件列表
Files Size Format View
1044954.pdf 389KB PDF download
  文献评价指标  
  下载次数:5次 浏览次数:29次