Certification of Completion of ASC FY08 Level-2 Milestone ID #2933 | |
Lipari, D A | |
关键词: ARCHITECTURE; DESIGN; GEOMETRY; LAUNCHING; LAWRENCE LIVERMORE NATIONAL LABORATORY; PRODUCTION; RESOURCE MANAGEMENT; SCHEDULES; STATISTICS; TOPOLOGY; | |
DOI : 10.2172/945523 RP-ID : LLNL-TR-404701 PID : OSTI ID: 945523 Others : TRN: US200904%%74 |
|
学科分类:社会科学、人文和艺术(综合) | |
美国|英语 | |
来源: SciTech Connect | |
【 摘 要 】
This report documents the satisfaction of the completion criteria associated with ASC FY08 Milestone ID No.2933: 'Deploy Moab resource management services on BlueGene/L'. Specifically, this milestone represents LLNL efforts to enhance both SLURM and Moab to extend Moab's capabilities to schedule and manage BlueGene/L, and increases portability of user scripts between ASC systems. The completion criteria for the milestone are the following: (1) Batch jobs can be specified, submitted to Moab, scheduled and run on the BlueGene/L system; (2) Moab will be able to support the markedly increased scale in node count as well as the wiring geometry that is unique to BlueGene/L; and (3) Moab will also prepare and report statistics of job CPU usage just as it does for the current systems it supports. This document presents the completion evidence for both of the stated milestone certification methods: Completion evidence for this milestone will be in the form of (1) documentation--a report that certifies that the completion criteria have been met; and (2) user hand-off. As the selected Tri-Lab workload manager, Moab was chosen to replace LCRM as the enterprise-wide scheduler across Livermore Computing (LC) systems. While LCRM/SLURM successfully scheduled jobs on BG/L, the effort to replace LCRM with Moab on BG/L represented a significant challenge. Moab is a commercial product developed and sold by Cluster Resources, Inc. (CRI). Moab receives the users batch job requests and dispatches these jobs to run on a specific cluster. SLURM is an open-source resource manager whose development is managed by members of the Integrated Computational Resource Management Group (ICRMG) within the Services and Development Division at LLNL. SLURM is responsible for launching and running jobs on an individual cluster. Replacing LCRM with Moab on BG/L required substantial changes to both Moab and SLURM. While the ICRMG could directly manage the SLURM development effort, the work to enhance Moab had to be done by Moab's vendor. Members of the ICRMG held many meetings with CRI developers to develop the design and specify the requirements for what Moab needed to do. Extensions to SLURM are used to run jobs on the BlueGene/L architecture. These extensions support the three dimensional network topology unique to BG/L. While BG/L geometry support was already in SLURM, enhancements were needed to provide backfill capability and answer 'will-run' queries from Moab. For its part, the Moab architecture needed to be modified to interact with SLURM in a more coordinated way. It needed enhancements to support SLURM's shorthand notation for representing thousands of compute nodes and report this information using Moab's existing status commands. The LCRM wrapper scripts that emulated LCRM commands also needed to be enhanced to support BG/L usage. The effort was successful as Moab 5.2.2 and SLURM 1.3 was installed on the 106496 node BG/L machine on May 21, 2008, and turned over to the users to run production.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
RO201705180001767LZ | 184KB | download |