会议论文详细信息
20th International Conference on Computing in High Energy and Nuclear Physics
Automating usability of ATLAS Distributed Computing resources
物理学;计算机科学
Tupputi, S.A.^1 ; Girolamo, A. Di^2 ; Kouba, T.^3 ; Schovancová, J.^4
INFN-CNAF, viale B. Pichat, 6/2, Bologna
40127, Italy^1
CERN, Geneva 23
CH-1211, Switzerland^2
Institute of Physics, Academy of Sciences of the Czech Republic, Na Slovance 2, Prague 8
CZ-18221, Czech Republic^3
Brookhaven National Laboratory, Upton
NY, United States^4
关键词: Automatic handling;    Decision criterions;    Distributed computing resources;    Human interactions;    Inference algorithm;    Performance enhancing;    Suitable solutions;    Time granularities;   
Others  :  https://iopscience.iop.org/article/10.1088/1742-6596/513/3/032098/pdf
DOI  :  10.1088/1742-6596/513/3/032098
学科分类:计算机科学(综合)
来源: IOP
PDF
【 摘 要 】
The automation of ATLAS Distributed Computing (ADC) operations is essential to reduce manpower costs and allow performance-enhancing actions, which improve the reliability of the system. In this perspective a crucial case is the automatic handling of outages of ATLAS computing sites storage resources, which are continuously exploited at the edge of their capabilities. It is challenging to adopt unambiguous decision criteria for storage resources of non-homogeneous types, sizes and roles. The recently developed Storage Area Automatic Blacklisting (SAAB) tool has provided a suitable solution, by employing an inference algorithm which processes history of storage monitoring tests outcome. SAAB accomplishes both the tasks of providing global monitoring as well as automatic operations on single sites. The implementation of the SAAB tool has been the first step in a comprehensive review of the storage areas monitoring and central management at all levels. Such review has involved the reordering and optimization of SAM tests deployment and the inclusion of SAAB results in the ATLAS Site Status Board with both dedicated metrics and views. The resulting structure allows monitoring the storage resources status with fine time-granularity and automatic actions to be taken in foreseen cases, like automatic outage handling and notifications to sites. Hence, the human actions are restricted to reporting and following up problems, where and when needed. In this work we show SAAB working principles and features. We present also the decrease of human interactions achieved within the ATLAS Computing Operation team. The automation results in a prompt reaction to failures, which leads to the optimization of resource exploitation.
【 预 览 】
附件列表
Files Size Format View
Automating usability of ATLAS Distributed Computing resources 1259KB PDF download
  文献评价指标  
  下载次数:7次 浏览次数:16次