学位论文

【摘要】

Reliability is increasingly becoming a challenge for high-performance computing (HPC) systems with thousands of nodes, such as IBM's Blue Gene/L. A shorter mean-timeto-failure can be addressed by adding fault tolerance to reconfigure working nodes to ensure that communication and computationcan progress. However, existing approaches fall short in providing scalability and small reconfiguration overhead within the fault-tolerant layer.This thesis presents a scalable approach to reconfigure the communication infrastructure after node failures. We propose a decentralized (peer-to-peer) protocol that maintains aconsistent view of active nodes in the presence of faults. Our protocol shows response time in the order of hundreds of microseconds and single-digit milliseconds for reconfiguration using MPI over BlueGene/L and TCP over Gigabit, respectively. The protocol can be adapted to match the network topology to further increase performance. We also verify experimental results against a performance model, which demonstrates the scalability of the approach. Hence, the membership service is suitable for deployment in thecommunication layer of MPI runtime systems.

【预览】

附件列表
Files	Size	Format	View
Scalable, Fault-Tolerant Membership for Group Communication on HPC Systems	425KB	PDF	download


Scalable, Fault-Tolerant Membership for Group Communication on HPC Systems
BlueGene/L;Parallel Computing;High Performance;Scalable;Fault Tolerant;Group Communication
Varma, Jyothish S ; Dr. Tao Xie, Committee Member,Dr. Vincent Freeh, Committee Member,Dr. Frank Mueller, Committee Chair,Varma, Jyothish S ; Dr. Tao Xie ; Committee Member ; Dr. Vincent Freeh ; Committee Member ; Dr. Frank Mueller ; Committee Chair
University:North Carolina State University
关键词: BlueGene/L; Parallel Computing; High Performance; Scalable; Fault Tolerant; Group Communication;
Others : https://repository.lib.ncsu.edu/bitstream/handle/1840.16/2089/etd.pdf?sequence=1&isAllowed=y
美国\|英语
来源: null
PDF


	文献评价指标
	下载次数：61次	浏览次数：12次

【 摘 要 】

【 预 览 】

【摘要】

【预览】