学位论文详细信息
Analysis of Gemini interconnect recovery mechanisms: methods and observations
High Performance Computing;Fault Tolerance;Interconnects
Jha, Saurabh ; Iyer ; Ravishankar K.
关键词: High Performance Computing;    Fault Tolerance;    Interconnects;   
Others  :  https://www.ideals.illinois.edu/bitstream/handle/2142/95450/JHA-THESIS-2016.pdf?sequence=1&isAllowed=y
美国|英语
来源: The Illinois Digital Environment for Access to Learning and Scholarship
PDF
【 摘 要 】

This thesis focuses on the resilience of network components, and recovery capabilities of extreme-scale high-performance computing (HPC) systems, specifically petaflop-level supercomputers, aimed at solving complex science, engineering, and business problems that require high bandwidth, enhanced networking, and high compute capabilities. The resilience of the network is critical for ensuring successful execution of the applications and overall system availability. Failure of interconnect components such as links, routers, power supply, etc. pose a threat to the resilience of the interconnect network, causing application failures and, in the worst case, system-wide failure. An extreme-scale system is designed to manage these failures and automatically recover from such failures to ensure successful application execution and avoid system-wide failure. Thus, in this thesis, we characterize the success probability of the recovery procedures as well as the impact of the recovery procedures on the applications.We developed an interconnect recovery mechanisms analysis tool (I-RAT), a plugin built on top of LogDiverto characterize and assess the impact of recovery mechanisms. The tool was used to analyze more than two years of network/system logs from Blue Waters, a supercomputer operated by the NCSA at the University of Illinois. Our analyses show that recovery mechanisms are frequently triggered (in as little as 36 hours for link failovers) that can fail with relatively high probability (as much as 0.25 for link failover). Furthermore, the analyses show that system resilience does not equate to application resilience since executing applications can fail with non-negligible probability during (or just after) a successful recovery.Our analyses show that interconnect recovery mechanisms are frequently triggered (the mean time between triggers is as short as 36 hours for link failovers), and the initiated recovery fails with relatively high probability (as much as 0.25 for link failover). We also show that as many as 20\% of the executing applications fail during the recovery phase.

【 预 览 】
附件列表
Files Size Format View
Analysis of Gemini interconnect recovery mechanisms: methods and observations 2046KB PDF download
  文献评价指标  
  下载次数:5次 浏览次数:31次