期刊论文详细信息
Brazilian Computer Society. Journal
Running resilient MPI applications on a Dynamic Group of Recommended Processes
article
Edson Tavares de Camargo1  Elias P. Duarte1 
[1] Department of Informatics, Federal University of Paraná (UFPR);Federal Technology University of Paraná (UTFPR)
关键词: Dynamic Group of Recommended Processes (DGRP);    Resilience;    Fault tolerance;    MPI applications;    HPC systems;   
DOI  :  10.1186/s13173-018-0069-z
来源: Springer U K
PDF
【 摘 要 】

High-performance computing systems run applications that can take several hours to execute and have to deal with the occurrence of a potentially large number of faults. Most of the existing fault-tolerant strategies for these systems assume crash faults that are permanent events are easily detected. This is not the case in several real systems, in particular in shared clusters, in which even the load variation may cause performance problems that are virtually equivalent to faults. In this work, we present a new model to deal with this problem in which processes execute tests among themselves in order to determine whether the processors (or cores) on which they are running are recommended or non-recommended. Processes classified as recommended form a Dynamic Group of Recommended Processes (DGRP) that runs the application. The DGRP is formed only by processes that have not been tested as non-recommended by all DGRP processes. A process not in the DGRP that is continuously tested as recommended can rejoin the DGRP after a round of consensus executed by DGRP processes. Experimental results are presented obtained from a MPI-based implementation in which the HyperQuickSort parallel sorting algorithm reconfigures itself at runtime to tolerate up to N − 1 faults (in a system with N processes) while sorting up to 1 billion integers.

【 授权许可】

Unknown   

【 预 览 】
附件列表
Files Size Format View
RO202106300002998ZK.pdf 8503KB PDF download
  文献评价指标  
  下载次数:0次 浏览次数:0次