科技报告详细信息
Cooperative application/OS DRAM fault recovery. | |
Ferreira, Kurt Brian ; Bridges, Patrick G. (University of New Mexico, Albuquerque, NM) ; Heroux, Michael Allen ; Hoemmen, Mark ; Brightwell, Ronald Brian | |
Sandia National Laboratories | |
关键词: Computers; Convergence; Programming; Computer Codes; 99 General And Miscellaneous//Mathematics, Computing, And Information Science; | |
DOI : 10.2172/1044954 RP-ID : SAND2012-4059 RP-ID : AC04-94AL85000 RP-ID : 1044954 |
|
美国|英语 | |
来源: UNT Digital Library | |
![]() |
【 摘 要 】
Exascale systems will present considerable fault-tolerance challenges to applications and system software. These systems are expected to suffer several hard and soft errors per day. Unfortunately, many fault-tolerance methods in use, such as rollback recovery, are unsuitable for many expected errors, for example DRAM failures. As a result, applications will need to address these resilience challenges to more effectively utilize future systems. In this paper, we describe work on a cross-layer application/OS framework to handle uncorrected memory errors. We illustrate the use of this framework through its integration with a new fault-tolerant iterative solver within the Trilinos library, and present initial convergence results.【 预 览 】
Files | Size | Format | View |
---|---|---|---|
1044954.pdf | 389KB | ![]() |