科技报告

【摘要】

Large-scale distributed systems assembled from commodity parts, like CPlant, have become common tools in the distributed computing world. Because of their size and diversity of parts, these systems are prone to failures. Applications that are being run on these systems have not been equipped to efficiently deal with failures, nor is there vendor support for fault tolerance. Thus, when a failure occurs, the application crashes. While most programmers make use of checkpoints to allow for restarting of their applications, this is cumbersome and incurs substantial overhead. In many cases, there are more efficient and more elegant ways in which to address failures. The goal of this project is to develop a software architecture for the detection of and recovery from faults in a cluster computing environment. The detection phase relies on the latest techniques developed in the fault tolerance community. Recovery is being addressed in an application-dependent manner, thus allowing the programmer to take advantage of algorithmic characteristics to reduce the overhead of fault tolerance. This architecture will allow large-scale applications to be more robust in high-performance computing environments that are comprised of clusters of commodity computers such as CPlant and SMP clusters.

【预览】

附件列表
Files	Size	Format	View
754901.pdf	698KB	PDF	download


Algorithm-dependent fault tolerance for distributed computing

Hough, P. D. ; Goldsby, M. e. ; Walsh, E. J.
Sandia National Laboratories
关键词: Fault Tolerant Computers; Computer Architecture; Algorithms; Distributed Data Processing; 99 General And Miscellaneous//Mathematics, Computing, And Information Science;
DOI : 10.2172/754901 RP-ID : SAND2000-8219 RP-ID : AC04-94AL85000 RP-ID : 754901
美国\|英语
来源: UNT Digital Library
PDF


	文献评价指标
	下载次数：8次	浏览次数：35次

【 摘 要 】

【 预 览 】

【摘要】

【预览】