会议论文详细信息
20th International Conference on Computing in High Energy and Nuclear Physics
Phronesis, a diagnosis and recovery tool for system administrators
物理学;计算机科学
Haen, C.^1 ; Barra, V.^2 ; Bonaccorsi, E.^3 ; Neufeld, N.^3
University Blaise Pascal, Clermont-ferrand cedex
63006, France^1
LIMOS, UMR 6158 CNRS, University Blaise Pascal, Clermont-ferrand cedex
63006, France^2
European Organization for Nuclear Research, CERN, Geneve 23
CH-1211, Switzerland^3
关键词: Autonomic Computing;    Heterogeneous computing;    Manipulation tools;    Shared experiences;    System administrators;   
Others  :  https://iopscience.iop.org/article/10.1088/1742-6596/513/6/062021/pdf
DOI  :  10.1088/1742-6596/513/6/062021
学科分类:计算机科学(综合)
来源: IOP
PDF
【 摘 要 】

The LHCb experiment relies on the Online system, which includes a very large and heterogeneous computing cluster. Ensuring the proper behavior of the different tasks running on the more than 2000 servers represents a huge workload for the small operator team and is a 24/7 task. At CHEP 2012, we presented a prototype of a framework that we designed in order to support the experts. The main objective is to provide them with steadily improving diagnosis and recovery solutions in case of misbehavior of a service, without having to modify the original applications. Our framework is based on adapted principles of the Autonomic Computing model, on Reinforcement Learning algorithms, as well as innovative concepts such as Shared Experience. While the submission at CHEP 2012 showed the validity of our prototype on simulations, we here present an implementation with improved algorithms and manipulation tools, and report on the experience gained with running it in the LHCb Online system.

【 预 览 】
附件列表
Files Size Format View
Phronesis, a diagnosis and recovery tool for system administrators 686KB PDF download
  文献评价指标  
  下载次数:16次 浏览次数:48次