会议论文详细信息
21st International Conference on Computing in High Energy and Nuclear Physics
A new Self-Adaptive disPatching System for local clusters
物理学;计算机科学
Kan, Bowen^1 ; Shi, Jingyan^1 ; Lei, Xiaofeng^1,2
Institute of High Energy Physics, Beijing, China^1
Graduate University of Chinese Academy of Sciences, Beijing, China^2
关键词: Batch monitoring;    Chinese Academy of Sciences;    Dispatching systems;    High performance cluster;    Monitoring functions;    Reliability and stability;    Resource utilizations;    Scheduling module;   
Others  :  https://iopscience.iop.org/article/10.1088/1742-6596/664/9/092015/pdf
DOI  :  10.1088/1742-6596/664/9/092015
学科分类:计算机科学(综合)
来源: IOP
PDF
【 摘 要 】

The scheduler is one of the most important components of a high performance cluster. This paper introduces a self-adaptive dispatching system (SAPS) based on Torque[1]and Maui[2]. It promotes cluster resource utilization and improves the overall speed of tasks. It provides some extra functions for administrators and users. First of all, in order to allow the scheduling of GPUs, a GPU scheduling module based on Torque and Maui has been developed. Second, SAPS analyses the relationship between the number of queueing jobs and the idle job slots, and then tunes the priority of users' jobs dynamically. This means more jobs run and fewer job slots are idle. Third, integrating with the monitoring function, SAPS excludes nodes in error states as detected by the monitor, and returns them to the cluster after the nodes have recovered. In addition, SAPS provides a series of function modules including a batch monitoring management module, a comprehensive scheduling accounting module and a real-time alarm module. The aim of SAPS is to enhance the reliability and stability of Torque and Maui. Currently, SAPS has been running stably on a local cluster at IHEP (Institute of High Energy Physics, Chinese Academy of Sciences), with more than 12,000 cpu cores and 50,000 jobs running each day. Monitoring has shown that resource utilization has been improved by more than 26%, and the management work for both administrator and users has been reduced greatly.

【 预 览 】
附件列表
Files Size Format View
A new Self-Adaptive disPatching System for local clusters 1258KB PDF download
  文献评价指标  
  下载次数:11次 浏览次数:80次