学位论文详细信息
Monitoring and analysis system for performance troubleshooting in data centers
Monitoring;Analysis;Performance;Troubleshooting;Data center
Wang, Chengwei ; Schwan, Karsten Computer Science Liu, Ling Blough, Douglas Wolf, Matthew Talwar, Vanish Mansour, Mohamed ; Schwan, Karsten
University:Georgia Institute of Technology
Department:Computer Science
关键词: Monitoring;    Analysis;    Performance;    Troubleshooting;    Data center;   
Others  :  https://smartech.gatech.edu/bitstream/1853/50411/1/WANG-DISSERTATION-2013.pdf
美国|英语
来源: SMARTech Repository
PDF
【 摘 要 】

It was not long ago. On Christmas Eve 2012, a war of troubleshooting began in Amazon data centers. It started at 12:24 PM, with an mistaken deletion of the state data of Amazon Elastic Load Balancing Service (ELB for short), which wasnot realized at that time. The mistake first led to a local issue that a small number of ELB service APIs were affected. In about six minutes, it evolved into a critical one that EC2 customers were significantly affected. One example was that Netflix, which was using hundreds of Amazon ELB services, was experiencing an extensive streaming service outage when many customers could not watch TV shows or movies on Christmas Eve. It took Amazon engineers 5 hours 42 minutes to find the root cause, the mistaken deletion, and another 15 hours and 32 minutes to fully recover the ELB service. The war ended at 8:15 AM the next day and brought the performancetroubleshooting in data centers to world’s attention. As shown in this Amazon ELB case.Troubleshooting runtime performance issues is crucial in time-sensitive multi-tier cloud services because of their stringent end-to-end timing requirements, but it is also notoriously difficult and time consuming.To address the troubleshooting challenge, this dissertation proposes VScope, a flexible monitoring and analysis system for online troubleshooting in data centers.VScope provides primitive operations which data center operators can use to troubleshoot various performance issues. Each operation is essentially a series of monitoring and analysis functions executed on an overlay network. We design a novelsoftware architecture for VScope so that the overlay networks can be generated, executed and terminated automatically, on-demand. From the troubleshooting side, we design novel anomaly detection algorithms and implement them in VScope. Byrunning anomaly detection algorithms in VScope, data center operators are notified when performance anomalies happen. We also design a graph-based guidance approach, called VFocus, which tracks the interactions among hardware and software components in data centers. VFocus provides primitive operations by which operators can analyze the interactions to find out which components are relevant to theperformance issue.VScope’s capabilities and performance are evaluated on a testbed with over 1000 virtual machines (VMs). Experimental results show that the VScope runtime negligibly perturbs system and application performance, and requires mere seconds to deploy monitoring and analytics functions on over 1000 nodes. This demonstrates VScope’s ability to support fast operation and online queries against a comprehensive set of application to system/platform level metrics, and a variety of representative analytics functions. When supporting algorithms with high computation complexity, VScope serves as a ‘thin layer’ that occupies no more than 5% of their total latency. Further, by using VFocus, VScope can locate problematic VMs that cannot be foundvia solely application-level monitoring, and in one of the use cases explored in the dissertation, it operates with levels of perturbation of over 400% less than what is seen for brute-force and most sampling-based approaches. We also validate VFocuswith real-world data center traces. The experimental results show that VFocus has troubleshooting accuracy of 83% on average.

【 预 览 】
附件列表
Files Size Format View
Monitoring and analysis system for performance troubleshooting in data centers 5112KB PDF download
  文献评价指标  
  下载次数:27次 浏览次数:17次