学位论文详细信息
Risk-based proactive availability management - attaining high performance and resilience with dynamic self-management in Enterprise Distributed Systems
High performance;Routing and scheduling;Overlay;Autonomic computing;HIgh availability;Distributed systems
Cai, Zhongtang ; Computing
University:Georgia Institute of Technology
Department:Computing
关键词: High performance;    Routing and scheduling;    Overlay;    Autonomic computing;    HIgh availability;    Distributed systems;   
Others  :  https://smartech.gatech.edu/bitstream/1853/22581/1/cai_zhongtang_200805_PhD.pdf
美国|英语
来源: SMARTech Repository
PDF
【 摘 要 】

Complex distributed systems such as distributed information flows systems which continuously acquire manipulate and disseminate information across an enterprise's distributed sites and machines, and distributed server applications co-deployed in one or multiple shared data centers, with each of them having different performance/availability requirements that vary over time and competing with each other for the shared resources, have been playing a more serious role in industry and society now. Consequently, it becomes more important for enterprise scale IT infrastructure to provide timely and sustained/reliable delivery and processing of service requests. This hasn't become easier, despite more than 30 years of progress in distributed computer connectivity, availability and reliability, if not more difficult~cite{ReliableDistributedSys}, because of many reasons. Some of them are, the increasing complexity of enterprise scale computing infrastructure; the distributed nature of these systems which make them prone to failures, e.g., because of inevitable Heisenbugs in these complex distributed systems; the need to consider diverse and complex business objectives and policies including risk preference and attitudes in enterprise computing; the issues of performance and availability conflicts, varying importance of sub-systems in an enterprise's distributed infrastructure which compete for resource in currently typical shared environment; and the best effort nature of resources such as network resources, which implies resource availability itself an issue, etc.This thesis proposes a novel business policy-driven risk-based automated availability managementwhich uses an automated decision engine to make various availability decisions and meet business policies while optimizing overall system utility, uses utility theory to capture users' risk attitudes, and address the potentially conflicting business goals and resource demands in enterprise scale distributed systems. For the critical and complex enterprise applications,since a key contributor to application utility is the time taken torecover from failures, we develop a novel proactive fault tolerance approach, which uses online methods for failure prediction to dynamically determine the acceptable amounts of additional processing and communication resources to be used (i.e., costs)to attain certain levels of utility and acceptable delays in failurerecovery.Since resource availability itself is often not guaranteed in typical shared enterprise IT environments, this thesis provides IQ-Paths with probabilistic service guarantee, to address the dynamic network behavior in realistic enterprise computing environment. The risk-based formulation is used as an effective way to link the operational guarantees expressed by utility andenforced by the PGOS algorithm with the higher level business objectives sought by end users.Together, this thesis proposes novel availability management framework and methods for large-scale enterprise applications and systems, with the goal to provide different levels of performance/availability guarantees for multiple applications andsub-systems in a complex shared distributed computing infrastructure. More specifically, this thesis addresses the following problems. For data center environments,(1) how to provide availability management for applications and systems that vary in both resource requirements and in their importance to the enterprise,based both on operational level quantities and on business level objectives; (2) how to deal with managerial policies such as risk attitude; and (3) how to deal with the tradeoff between performance and availability,given limited resources in a typical data center. Since realistic business settings extend beyond single data centers, a secondset of problems addressed in this thesis concerns predictable and reliableoperation in wide area settings. For such systems, we explore (4) how to provide high availability in widely distributed operational systems with low cost fault tolerance mechanisms, and (5) how to provide probabilistic service guarantees given best effort network resources.

【 预 览 】
附件列表
Files Size Format View
Risk-based proactive availability management - attaining high performance and resilience with dynamic self-management in Enterprise Distributed Systems 2117KB PDF download
  文献评价指标  
  下载次数:7次 浏览次数:12次