科技报告详细信息
Algorithm-dependent fault tolerance for distributed computing
Hough, P. D. ; Goldsby, M. e. ; Walsh, E. J.
Sandia National Laboratories
关键词: Fault Tolerant Computers;    Computer Architecture;    Algorithms;    Distributed Data Processing;    99 General And Miscellaneous//Mathematics, Computing, And Information Science;   
DOI  :  10.2172/754901
RP-ID  :  SAND2000-8219
RP-ID  :  AC04-94AL85000
RP-ID  :  754901
美国|英语
来源: UNT Digital Library
PDF
【 摘 要 】

Large-scale distributed systems assembled from commodity parts, like CPlant, have become common tools in the distributed computing world. Because of their size and diversity of parts, these systems are prone to failures. Applications that are being run on these systems have not been equipped to efficiently deal with failures, nor is there vendor support for fault tolerance. Thus, when a failure occurs, the application crashes. While most programmers make use of checkpoints to allow for restarting of their applications, this is cumbersome and incurs substantial overhead. In many cases, there are more efficient and more elegant ways in which to address failures. The goal of this project is to develop a software architecture for the detection of and recovery from faults in a cluster computing environment. The detection phase relies on the latest techniques developed in the fault tolerance community. Recovery is being addressed in an application-dependent manner, thus allowing the programmer to take advantage of algorithmic characteristics to reduce the overhead of fault tolerance. This architecture will allow large-scale applications to be more robust in high-performance computing environments that are comprised of clusters of commodity computers such as CPlant and SMP clusters.

【 预 览 】
附件列表
Files Size Format View
754901.pdf 698KB PDF download
  文献评价指标  
  下载次数:8次 浏览次数:35次