科技报告详细信息
An Exploration in Implementing Fault Tolerance in Scientific Simulation Application Software
DRAKE, RICHARD R. ; SUMMERS, RANDALL M.
Sandia National Laboratories
关键词: Fault Tolerant Computers;    Computerized Simulation;    Errors;    Computer Codes;    99 General And Miscellaneous//Mathematics, Computing, And Information Science;   
DOI  :  10.2172/811162
RP-ID  :  SAND2003-1651
RP-ID  :  AC04-94AL85000
RP-ID  :  811162
美国|英语
来源: UNT Digital Library
PDF
【 摘 要 】

The ability for scientific simulation software to detect and recover from errors and failures of supporting hardware and software layers is becoming more important due to the pressure to shift from large, specialized multi-million dollar ASCI computing platforms to smaller, less expensive interconnected machines consisting of off-the-shelf hardware. As evidenced by the CPlant{trademark} experiences, fault tolerance can be necessary even on such a homogeneous system and may also prove useful in the next generation of ASCI platforms. This report describes a research effort intended to study, implement, and test the feasibility of various fault tolerance mechanisms controlled at the simulation code level. Errors and failures would be detected by underlying software layers, communicated to the application through a convenient interface, and then handled by the simulation code itself. Targeted faults included corrupt communication messages, processor node dropouts, and unacceptable slowdown of service from processing nodes. Recovery techniques such as re-sending communication messages and dynamic reallocation of failing processor nodes were considered. However, most fault tolerance mechanisms rely on underlying software layers which were discovered to be lacking to such a degree that mechanisms at the application level could not be implemented. This research effort has been postponed and shifted to these supporting layers.

【 预 览 】
附件列表
Files Size Format View
811162.pdf 1339KB PDF download
  文献评价指标  
  下载次数:20次 浏览次数:70次