As device density grows, each transistor gets smaller and more fragile leading to an overall higher susceptibility to hard-faults. These hard-faults result in permanent silicon defects and impact manufacturing yield, performance, and lifetime of semiconductor devices. In this thesis, we propose comprehensive, low-cost solutions to tackle reliability problems in high-performance microprocessors. These microprocessors mainly consist of on-chip caches and core pipeline. We first present two flexible cache architectures, ZerehCache and Archipelago, to protect regular SRAM structures against high failure rates. ZerehCache virtually reorganizes the cache data array using a permutation network to provide higher degrees of freedom for spare allocation. In order to study the impact of fault patterns on the redundancy requirements in a cache, we propose a methodology to model the collision patterns in caches as a graph problem. Given this model, a graph coloring scheme is employed to minimize the amount of additional redundancy required for protecting the cache. Archipelago targets failures in near-threshold region. It resizes the cache to provide redundancy for repairing faulty cells. Furthermore, a near optimal minimum clique covering configuration algorithm is introduced to minimizes the cache capacity loss.With proper solutions in place for caches, a robust and heterogeneous core coupling execution scheme, Necromancer, is presented to protect the general core area against hard-faults. Although a faulty core cannot be trusted, we observe that for most defects, execution traces on a defective core coarsely resemble those of fault-free executions. Necromancer exploits a functionally dead core to improve system throughput by supplying hints regarding high-level program behavior. We partition the cores into multiple groups. Each group shares a lightweight core that can be substantially accelerated. However, due to the presence of defects, a perfect data or instruction stream cannot be provided by the dead core. This necessitates employing low-cost recovery mechanism and generic hints that are more resilient to local abnormalities.
【 预 览 】
附件列表
Files
Size
Format
View
Overcoming Hard-Faults in High-Performance Microprocessors.