学位论文详细信息
Failure avoidance techniques for HPC systems based on failure prediction
High Performance Computing (HPC);fault tolerance;resiliency;failure prediction;performance degradation
Gainaru, Ana
关键词: High Performance Computing (HPC);    fault tolerance;    resiliency;    failure prediction;    performance degradation;   
Others  :  https://www.ideals.illinois.edu/bitstream/handle/2142/88016/GAINARU-DISSERTATION-2015.pdf?sequence=1&isAllowed=y
美国|英语
来源: The Illinois Digital Environment for Access to Learning and Scholarship
PDF
【 摘 要 】
A increasingly larger percentage of computing capacity in today's large high-performance computing systems is wasted due to failures and recoveries. Moreover, it is expected that high performance computing will reach exascale within a decade, decreasing the mean time between failures to one day or even a few hours, making fault tolerance a major challenge for the HPC community. As a consequence, current research is focusing on providing fault tolerance strategies that aim to minimize fault's effects on applications. By far, the most popular and used techniques from this field are rollback-recovery protocols. However, existing rollback-recovery techniques have severe scalability limitations and without further optimizations the use of current protocols is put under serious questions for future exascale systems. A way of reducing the overhead induced by these strategies is by combining them with failure avoidance methods. Failure avoidance is based on a prediction model that detects fault occurrences ahead of time and allows preventive measures to be taken, such as task migration or checkpointing the application before failure. The same methodology can be generalized and applied to anomaly avoidance, where anomaly can mean anything from system failures to performance degradation at the application level. For this, monitoring systems require a reliable prediction system to give information on when failures will occur and at what location. Thus far, research in this field used ideal predictors that do not have any implementation in real HPC systems.This thesis focuses on analyzing and characterizing anomaly patterns at both the application and system levels and on offering solutions to prevent anomalies from affecting applications running in the system. Currently, there is no good characterization of normal behavior for system state data or how different components react to failures within HPC systems. For example, in case a node experiences a network failure and is incapable of generating log messages, the failure is announced in the log files by a lack of generated messages. Conversely, some component failures may cause logging a large numbers of notifications. For example, memory failures can result in a single faulty component generating hundreds or thousands of messages in less than a day. It is important to be able to capture the behavior of each event type and understand what is the normal behavior and how each failure type affects it. This idea represents the building block of a novel way of characterizing the state of the system in time by analyzing the properties of each event described in different system metrics, considering its own trend and behavior. The method introduces the integration between signal processing concepts and data mining techniques in the context of analysis for large-scale systems. By shaping the normal and faulty behavior of each event and of the whole system, appropriate models and methods for descriptive and forecasting purposes are proposed. After having an accurate overview of the whole system, the thesis analyzes how the prediction model impacts current fault tolerance techniques and in the end integrates it into a fault avoidance solution. This hybrid protocol optimizes the overhead that current fault tolerance strategies impose on applications and presents a viable solution for future large-scale systems.
【 预 览 】
附件列表
Files Size Format View
Failure avoidance techniques for HPC systems based on failure prediction 5033KB PDF download
  文献评价指标  
  下载次数:14次 浏览次数:37次