学位论文详细信息
Providing application-aware reliability through OS/hypervisor-level techniques
checkpoint;reliability, virtual machine;hypervisor;system hang;microkernel;operating system;error detection;error injection
Wang, Long
关键词: checkpoint;    reliability, virtual machine;    hypervisor;    system hang;    microkernel;    operating system;    error detection;    error injection;   
Others  :  https://www.ideals.illinois.edu/bitstream/handle/2142/18440/Wang_Long.pdf?sequence=1&isAllowed=y
美国|英语
来源: The Illinois Digital Environment for Access to Learning and Scholarship
PDF
【 摘 要 】

Operating systems and hypervisors enable the collection and extraction of rich information on application and system execution characteristics. This thesis describes a Reliability MicroKernel (RMK) architecture, which provides an infrastructure that enables the design and deployment of software modules for providing application-aware error detection and recovery. The purpose of the RMK is to provide an automatic approach for low-latency crash/hang detection and rapid recovery via checkpoint. We first demonstrate how the RMK works in a native system and then enhance the RMK to work in VMs. In a native system, the RMK is installed as a device driver, while in a virtualized system, the RMK is both installed as a device driver in VMs and deployed as a hypercall (which is like a system call) in a hypervisor. Our approach is transparent to applications and VMs, i.e., it is not required to modify or recompile the kernel source code in a native system or in a VM. The implemented RMK modules include OS/application crash detection, system hang detection, and transparent checkpoint. Traditionally, an external hardware watchdog is used to force a system reboot whenever the watchdog is not reset within a predefined timeout interval. The detection latency might be significant because the timeout interval for resetting the watchdog timer is usually a matter of seconds to reduce false alarms. The approach in this thesis enables low-latency OS-hang detection (within hundreds of milliseconds or less) by measuring the count of instructions executed between two consecutive context switches and checking if the count exceeds a predefined threshold value.The RMK is enhanced to support virtualized environments. Specifically, we present the description, implementation, and experimental assessment of VM-μCheckpoint, a VM checkpointing framework to protect both the guest OS and applications against runtime errors. Compared with the existing VM checkpoint techniques, our VM-μCheckpoint has small overhead and rapid recovery, handles non-fail-stop errors, and runs at high frequency (tens of checkpoints per second) to reduce the recomputation necessary when recovering a VM from a failure. The key point of VM-μCheckpoint is that we do an incremental checkpoint by considering the whole memory of the protected VM as part of the checkpoint. The RMK prototype has been implemented in both Linux and Windows systems on a Pentium 4 processor and is also implemented in the Xen VMM. (The Xen hypervisor is recompiled for installing RMK, but the OS of a native system or a VM is not recompiled.) Error injection experiments show that our RMK detects all the crashes and system hangs, and VM-μCheckpoint successfully recovers VMs from all the crashes. Moreover, the experimental evaluation of the RMK using real-world applications shows that we achieve high coverage and low false-positive rates for error detection (e.g., no false positives for system hang detection) as well as low overhead in providing checkpoint and recovery (e.g., an average of 6.3% overhead in VM-μCheckpoint for SPEC benchmark programs with 50 ms checkpoint intervals). We also apply a formal method and analytical/probilistic models to verify the capability of our system hang detection and to study the availability enhancement provided by the RMK.

【 预 览 】
附件列表
Files Size Format View
Providing application-aware reliability through OS/hypervisor-level techniques 1606KB PDF download
  文献评价指标  
  下载次数:51次 浏览次数:35次