科技报告详细信息
Achieving Scalable Automated Diagnosis of Distributed Systems Performance Problems
Huang, Chengdu ; Cohen, Ira ; Symons, Julie ; Abdelzaher, Tarek
HP Development Company
关键词: system performance diagnosis;    machine learning;    transfer learning;    scalability;   
RP-ID  :  HPL-2006-160R1
学科分类:计算机科学(综合)
美国|英语
来源: HP Labs
PDF
【 摘 要 】

Distributed systems continue to grow in scale and complexity, resulting in increasingly more involved interactions among components and increasingly more intricate failure modes that are very hard to diagnose manually. This increased vulnerability of larger systems, together with the increased difficulty of failure diagnosis, has motivated machine learning approaches to automate the diagnosis task. While preliminary encouraging results are achieved, scaling up the existing approaches to large applications remains challenging. With increase in scale, current approaches suffer the curse of dimensionality exacerbated by the exploding set of system states and measured metrics. In this paper, we significantly improve scalability of performance diagnosis methods. Our contributions lie in the use of (i) an intelligent partitioning of the metric space, coupled with a cooperative temporal segmentation algorithm, dividing system observations in time and in space to remove the multiplicative explosion of system states, and (ii) transfer learning techniques that improve accuracy by leveraging dependencies among the partitions. We validate our approaches on several months of production traces from a customer-facing geographically distributed, 24x7, 3-tier internet service. Our results show a significant accuracy improvement (350n average) over the naive partitioning of the state space (without the new temporal segmentation algorithm or transfer learning), and an order of magnitude reduction in computational cost over the .brute force. approach of learning with no partitioning, without loss of accuracy. 14 Pages

【 预 览 】
附件列表
Files Size Format View
RO201804100001355LZ 338KB PDF download
  文献评价指标  
  下载次数:5次 浏览次数:22次