学位论文详细信息
From experiment to design – fault characterization and detection in parallel computer systems using computational accelerators
Fault tolerance system design;Experimental validation;Error detection;Fault injection;Measurement-based co-design;Graphics Processing Unit fault tolerance;Message Passing Interface;CPU-GPU hybrid computers;COTS-based mission-critical systems;Reliability;Dependability
Yim, Keun Soo
关键词: Fault tolerance system design;    Experimental validation;    Error detection;    Fault injection;    Measurement-based co-design;    Graphics Processing Unit fault tolerance;    Message Passing Interface;    CPU-GPU hybrid computers;    COTS-based mission-critical systems;    Reliability;    Dependability;   
Others  :  https://www.ideals.illinois.edu/bitstream/handle/2142/44390/Keun%20Soo_Yim.pdf?sequence=1&isAllowed=y
美国|英语
来源: The Illinois Digital Environment for Access to Learning and Scholarship
PDF
【 摘 要 】

This dissertation summarizes experimental validation and co-design studies conducted to optimize the fault detection capabilities and overheads in hybrid computer systems (e.g., using CPUs and Graphics Processing Units, or GPUs), and consequently to improve the scalability of parallel computer systems using computational accelerators. The experimental validation studies were conducted to help us understand the failure characteristics of CPU-GPU hybrid computer systems under various types of hardware faults. The main characterization targets were faults that are difficult to detect and/or recover from, e.g., faults that cause long latency failures (Ch. 3), faults in dynamically allocated resources (Ch. 4), faults in GPUs (Ch. 5), faults in MPI programs (Ch. 6), and microarchitecture-level faults with specific timing features (Ch. 7).The co-design studies were based on the characterization results. One of the co-designed systems has a set of source-to-source translators that customize and strategically place error detectors in the source code of target GPU programs (Ch. 5). Another co-designed system uses an extension card to learn the normal behavioral and semantic execution patterns of message-passing processes executing on CPUs, and to detect abnormal behaviors of those parallel processes (Ch. 6). The third co-designed system is a co-processor that has a set of new instructions in order to support software-implemented fault detection techniques (Ch. 7).The work described in this dissertation gains more importance because heterogeneous processors have become an essential component of state-of-the-art supercomputers. GPUs were used in three of the five fastest supercomputers that were operating in 2011. Our work included comprehensive fault characterization studies in CPU-GPU hybrid computers. In CPUs, we monitored the target systems for a long period of time after injecting faults (a temporally comprehensive experiment), and injected faults into various types of program states that included dynamically allocated memory (to be spatially comprehensive). In GPUs, we used fault injection studies to demonstrate the importance of detecting silent data corruption (SDC) errors that are mainly due to the lack of fine-grained protections and the massive use of fault-insensitive data. This dissertation also presents transparent fault tolerance frameworks and techniques that are directly applicable to hybrid computers built using only commercial off-the-shelf hardware components.This dissertation shows that by developing understanding of the failure characteristics and error propagation paths of target programs, we were able to create fault tolerance frameworks and techniques that can quickly detect and recover from hardware faults with low performance and hardware overheads.

【 预 览 】
附件列表
Files Size Format View
From experiment to design – fault characterization and detection in parallel computer systems using computational accelerators 3130KB PDF download
  文献评价指标  
  下载次数:20次 浏览次数:14次