Byzantine faults in distributed systems can have very destructive consequences for services built on top of these systems but are not commonly tolerated in production systems due to the overhead and scalability limitations with existing approaches such as Byzantine fault tolerance. This work describes a reactive protocol for recovering from Byzantine failures in replicated state machines. In contrast to traditional Byzantine fault tolerance (BFT), which attempts to mask faults, this protocol is designed to allow faults to be exposed to clients but ensures that no client can fork the state of the system by rolling back faulty updates once they are detected. This ensures that, in spite of Byzantine failures, the system will always converge to a consistent state. The system provides a contract to the client called lapse consistency that bounds the number of inconsistent reads that can be experienced as a result of the rollbacks that it performs. This system extends prior work on Byzantine detection to provide an integrated system that can not only eventually detect, but also respond to Byzantine faults with provable consistency semantics while preserving many of the important properties of Byzantine detection such as scalability, and responsiveness. We evaluate the overhead of a proof of concept implementation of the system.
【 预 览 】
附件列表
Files
Size
Format
View
Eventual fault recovery strategies for Byzantine failures