科技报告详细信息
A Framework For Evaluating Comprehensive Fault Resilience Mechanisms In Numerical Programs
Chen, S.1  Peng, L.1  Bronevetsky, G.1 
[1] Lawrence Livermore National Lab. (LLNL), Livermore, CA (United States)
关键词: Soft faults;    High-performance computing;    Numerical errors;    Fault resilience;   
DOI  :  10.2172/1179432
RP-ID  :  LLNL-SR--666073
PID  :  OSTI ID: 1179432
学科分类:数学(综合)
美国|英语
来源: SciTech Connect
PDF
【 摘 要 】

As HPC systems approach Exascale, their circuit feature will shrink, while their overall size will grow, all at a fixed power limit. These trends imply that soft faults in electronic circuits will become an increasingly significant problem for applications that run on these systems, causing them to occasionally crash or worse, silently return incorrect results. This is motivating extensive work on application resilience to such faults, ranging from generic techniques such as replication or checkpoint/restart to algorithm-specific error detection and resilience techniques. Effective use of such techniques requires a detailed understanding of (1) which vulnerable parts of the application are most worth protecting (2) the performance and resilience impact of fault resilience mechanisms on the application. This paper presents FaultTelescope, a tool that combines these two and generates actionable insights by presenting in an intuitive way application vulnerabilities and impact of fault resilience mechanisms on applications.

【 预 览 】
附件列表
Files Size Format View
1892KB PDF download
  文献评价指标  
  下载次数:26次 浏览次数:38次