学位论文详细信息
Transparent Fault Tolerance for Job Healing in HPC Environments
job input data;fault tolerance;high-performance computing;fault resilience;checkpoint/restart
Wang, Chao ; Dr. Frank Mueller, Committee Chair,Dr. Xiaosong Ma, Committee Member,Dr. Yan Solihin, Committee Member,Dr. Nagiza Samatova, Committee Member,Wang, Chao ; Dr. Frank Mueller ; Committee Chair ; Dr. Xiaosong Ma ; Committee Member ; Dr. Yan Solihin ; Committee Member ; Dr. Nagiza Samatova ; Committee Member
University:North Carolina State University
关键词: job input data;    fault tolerance;    high-performance computing;    fault resilience;    checkpoint/restart;   
Others  :  https://repository.lib.ncsu.edu/bitstream/handle/1840.16/4437/etd.pdf?sequence=1&isAllowed=y
美国|英语
来源: null
PDF
【 摘 要 】

As the number of nodes in high-performance computing environments keeps increasing, faults are becoming common place causing losses in intermediate results of HPC jobs. Furthermore, storage systems providing job input data have been shown to consistently rank as the primary source of system failures leading to data unavailability and job resubmissions.This dissertation presents a combination of multiple fault tolerance techniques that realize significant advances in fault resilience of HPC jobs. The efforts encompass two broad areas.First, at the job level, novel, scalable mechanisms are built in support of proactiveFT and to significantly enhance reactive FT. The contributions of this dissertation in thisarea are (1) a transparent job pause mechanism, which allows a job to pause when a processfails and prevents it from having to re-enter the job queue; (2) a proactive fault-tolerantapproach that combines process-level live migration with health monitoring to complementreactive with proactive FT and to reduce the number of checkpoints when a majority of thefaults can be handled proactively; (3) a novel back migration approach to eliminate loadimbalance or bottlenecks caused by migrated tasks; and (4) an incremental checkpointingmechanism, which is combined with full checkpoints to explore the potential of reducing theoverhead of checkpointing by performing fewer full checkpoints interspersed with multiplesmaller incremental checkpoints.Second, for the job input data, transparent techniques are provided to improve thereliability, availability and performance of HPC I/O systems. In this area, the dissertationcontributes (1) a mechanism for offline job input data reconstruction to ensure availabilityof job input data and to improve center-wide performance at no cost to job owners; (2)an approach to automatic recover job input data at run-time during failures by recoveringstaged data from an original source; and (3) “just in timeâ€Âreplication of job input data soas to maximize the use of supercomputer cycles.Experimental results demonstrate the value of these advanced fault tolerance techniquesto increase fault resilience in HPC environments.

【 预 览 】
附件列表
Files Size Format View
Transparent Fault Tolerance for Job Healing in HPC Environments 1811KB PDF download
  文献评价指标  
  下载次数:24次 浏览次数:27次