Initial Proposal for MPI 3.0 Error Handling | |
Bronevetsky, G | |
关键词: PROCESSING; SPECIFICATIONS; TOLERANCE; | |
DOI : 10.2172/945669 RP-ID : LLNL-TR-405242 PID : OSTI ID: 945669 Others : TRN: US200904%%120 |
|
学科分类:社会科学、人文和艺术(综合) | |
美国|英语 | |
来源: SciTech Connect | |
【 摘 要 】
The MPI 2 spec contains error handling and notification mechanisms that have a number of limitations from the point of view of application fault tolerance: (1) The specification makes no demands on MPI to survive failures. Although MPI implementers are encouraged to 'circumscribe the impact of an error, so that normal processing can continue after an error handler was invoked', nothing more is specified in the standard. In particular, the defined MPI error classes are used only to clarify to the user the source of the error and do not describe the MPI functionality that is not available as a result of the error. (2) All errors must somehow be associated with some specific MPI call. As such, (A) It is difficult for MPI to notify users of failures in asynchronous calls, such as an MPI{_}Rsend call, which may return immediately after the message data is sent along the wire but before it is successfully delivered; (B) There is no provision for asynchronous error notification regarding errors that will affect future calls, such as notifying process p of the failure of process q before p tries to communicate with q. (3) There is no description of when error notification will happen relative to the occurrence of the error. In particular, the specification does not state whether an error that would cause MPI functions to return an error code under the MPI{_}ERRORS{_}RETURN error handler would cause a user-defined error handler to be called during the same MPI function or at some earlier or later point in time. (4) Although MPI makes it possible for libraries to define their own error classes and invoke application error handlers, it is not possible for the application to define new error notification patterns either within or across processes. This means that it is not possible for one application process to ask to be informed of errors on other processes or for the application to be informed of specific classes of errors.
【 预 览 】
Files | Size | Format | View |
---|---|---|---|
RO201705180001541LZ | 338KB | download |