会议论文详细信息
21st International Conference on Computing in High Energy and Nuclear Physics
Integration of PanDA workload management system with Titan supercomputer at OLCF
物理学;计算机科学
De, K.^1 ; Klimentov, A.^2 ; Oleynik, D.^1 ; Panitkin, S.^2 ; Petrosyan, A.^1 ; Schovancova, J.^2 ; Vaniachine, A.^3 ; Wenaus, T.^2
Department of Physics, University of Texas at Arlington, Arlington
TX
76019, United States^1
Brookhaven National Lab, Upton
NY
10573, United States^2
Argonne National Lab, 9700 S. Cass Avenue, Lemont
IL
60439, United States^3
关键词: ATLAS experiment;    Computing facilities;    Current computing;    Distributed analysis;    Precise definition;    U.S. Department of Energy;    Utilization efficiency;    Workload management;   
Others  :  https://iopscience.iop.org/article/10.1088/1742-6596/664/9/092020/pdf
DOI  :  10.1088/1742-6596/664/9/092020
学科分类:计算机科学(综合)
来源: IOP
PDF
【 摘 要 】

The PanDA (Production and Distributed Analysis) workload management system (WMS) was developed to meet the scale and complexity of LHC distributed computing for the ATLAS experiment. While PanDA currently distributes jobs to more than 100,000 cores at well over 100 Grid sites, the future LHC data taking runs will require more resources than Grid computing can possibly provide. To alleviate these challenges, ATLAS is engaged in an ambitious program to expand the current computing model to include additional resources such as the opportunistic use of supercomputers. We will describe a project aimed at integration of PanDA WMS with Titan supercomputer at Oak Ridge Leadership Computing Facility (OLCF). The current approach utilizes a modified PanDA pilot framework for job submission to Titan's batch queues and local data management, with light-weight MPI wrappers to run single threaded workloads in parallel on Titan's multicore worker nodes. It also gives PanDA new capability to collect, in real time, information about unused worker nodes on Titan, which allows precise definition of the size and duration of jobs submitted to Titan according to available free resources. This capability significantly reduces PanDA job wait time while improving Titan's utilization efficiency. This implementation was tested with a variety of Monte-Carlo workloads on Titan and is being tested on several other supercomputing platforms. Notice: This manuscript has been authored, by employees of Brookhaven Science Associates, LLC under Contract No. DE-AC02-98CH10886 with the U.S. Department of Energy. The publisher by accepting the manuscript for publication acknowledges that the United States Government retains a non-exclusive, paid-up, irrevocable, world-wide license to publish or reproduce the published form of this manuscript, or allow others to do so, for United States Government purposes.

【 预 览 】
附件列表
Files Size Format View
Integration of PanDA workload management system with Titan supercomputer at OLCF 769KB PDF download
  文献评价指标  
  下载次数:14次 浏览次数:63次