High performance parallel machines with hundreds of thousands of processors andpetascale performance are already in use, and even larger Exaflops scale computingsystems which may have hundreds of millions of cores are planned. To run parallelapplications on machines of such massive scale, one of the biggest challenges is theparallel startup process. This task involves two components: (1) parallel launchingof appropriate processes on the given set of processors and (2) setting up communication channels to enable the processes to communicate with each other after processlaunching has completed. Most current startup mechanisms focus on either usingspecial purpose daemons which waste system resources or using a startup managerwhich becomes a scalability bottleneck. In this thesis, we investigate the design andscalability of a SMP-aware, multi-level startup scheme with batching of remote shellsessions, which provides a complete solution to startup of a parallel application andfacilitates its management during execution. It still supports existing Charm++runtime capabilities including process health monitoring, facilitation of recoveryfrom failures and scalable interaction with the application. We demonstrate theperformance and scalability of this scheme by applying it to startup Charm++applications. In particular, starting up a Charm++ program on 16,384 cores ofRanger (at TACC) with Ethernet as the underlying communication layer takes only25 seconds and attains a speedup of over 400% compared to MPICH2-1.3 startup(using Hydra as process manager) and over 800% compared to Open MPI 1.3.1startup on Ranger.
【 预 览 】
附件列表
Files
Size
Format
View
A multi-level scalable startup for parallel applications