Distributed graph processing systems are an emerging area of big data systems. As graphs continue to grow in size and prevalence, these systems must become faster and more scalable. However, after failures, distributed graph processing systems either largely rely on proactive fault tolerance techniques such as checkpointing, or use no fault tolerance mechanisms at all and simply restart computation. The former approach entails significant proactive overheads that increase with the size of the graph, while the latter wastes time and resources in potentially lengthy recomputation. In this thesis, we argue that distributed graph processing systems should instead use a approximate approach to failure recovery that trades off minimal amounts of application accuracy while reducing the overhead during failure-free execution to zero, and allowing fast and scalable recovery. We build a system called Zorro that imbues the approximate reactive approach, and integrate Zorro into two distributed graph processing systems -- PowerGraph and LFGraph. When a failure occurs, Zorro opportunistically exploits vertex replication (inherent in today's graph processing systems) to quickly and scalably rebuild the state of failed servers. In addition, we describe three other novel failure recovery mechanisms that aim to address several of Zorro's shortcomings. The first utilizes optimistic accuracy results from graph sampling and hence continues after failure without taking any action. The second repartitions the graph after failure to avoid waiting for replacement servers, and then continues computation with the recovered state. The last allows a small amount of proactive overhead to significantly increase the fraction of recovered state.Experiments using five real-world graphs and eight benchmark applications demonstrate that Zorro is able to recover over 99% of the graph state when a few servers fail, and between 87-92% when half the cluster fails, with recovery taking only a fraction of the cost of a single iteration. Furthermore, using eight common graph processing algorithms, Zorro incurs little to no accuracy loss in all experimental failure scenarios. Furthermore, preliminary analysis and experiments using our three alternative approaches suggest that they are able to address many of the potential issues Zorro faces with minimal overhead and accuracy loss.
【 预 览 】
附件列表
Files
Size
Format
View
Approximate failure recovery in distributed graph processing systems