Modern storage systems continue to increase in scale and complexityas they attempt to meet the increasing storage needsof our society.Additionally, increased requirements to comply withgovernment regulation and consumer expectations have increased theneed to make data more available and reliable for longer periods of time. The design of modern and next-generation storage systems is a difficult taskthat requires high storage capacity and efficiency while alsomaintaining the data integrity.The rapid advancement of storage system technologies brings with it alevel of uncertainty as to the fitness of new designs and methods for meetingthe complex requirements.New technologies, like deduplication, promise improved storage efficiency, but their impact on reliability measures is uncleardue to the complex relationships inherent to the systems that employ these technologies. Additionally, as systems scale up, they become subject to faults and errors that previous-generation systems may never have encountered dueto the rare nature of these faults.Because ofthe stiffnessof the represented systems, and the complex relationships involved,it can be difficult to analyze these environments correctly and efficiently.In this dissertation, we propose a method to analyze storage system reliability by usingcomponent-based models coupled with realistic fault models.We solve these complex systemsby identifying fault, fault propagation, and mitigation events; by identifying dependence relationshipsbetween state variables, events, and rewards; and by decomposing our model at various pointsduring model solution to improve the efficiency of our solutionwhile maintaining the correctness of our reward measures.In particular, we discuss building scalable component-based models of large-scale systems that employ modern reliability methods, such as RAID,and state-of-the-art storage efficiency methods such as deduplication.Wepresent detailed fault models for these systems, including a novel model for undetected disk errors.To enable efficient solution of these models we propose a method to analyze the dependence relationships that underlie storage systems and propose a way to solve these models by identifying and exploiting these relationships when solving for reliability measures.We apply our methods to real-world systems, detail the consequences for the reliability of deduplication,and suggest and evaluate methods to improve reliability while still maintaining improved storage efficiency.
【 预 览 】
附件列表
Files
Size
Format
View
Understanding the fault-tolerance properties of large-scale storage systems