科技报告详细信息
OVIS 2.0 user%3CU%2B2019%3Es guide.
Mayo, Jackson R. ; Gentile, Ann C. ; Brandt, James M. ; Thompson, David C. ; Roe, Diana C. ; Wong, Matthew H. ; Pebay, Philippe Pierre
关键词: APPLIANCES;    COMPUTERS;    COMPUTER CODES;    PROGRAMMING;    ENGINEERS;    MONITORING;    PERFORMANCE;    STORAGE;    SWITCHES;    TARGETS;   
DOI  :  10.2172/1028957
RP-ID  :  SAND2009-2329
PID  :  OSTI ID: 1028957
Others  :  TRN: US201201%%68
学科分类:社会科学、人文和艺术(综合)
美国|英语
来源: SciTech Connect
PDF
【 摘 要 】

This document describes how to obtain, install, use, and enjoy a better life with OVIS version 2.0. The OVIS project targets scalable, real-time analysis of very large data sets. We characterize the behaviors of elements and aggregations of elements (e.g., across space and time) in data sets in order to detect anomalous behaviors. We are particularly interested in determining anomalous behaviors that can be used as advance indicators of significant events of which notification can be made or upon which action can be taken or invoked. The OVIS open source tool (BSD license) is available for download at ovis.ca.sandia.gov. While we intend for it to support a variety of application domains, the OVIS tool was initially developed for, and continues to be primarily tuned for, the investigation of High Performance Compute (HPC) cluster system health. In this application it is intended to be both a system administrator tool for monitoring and a system engineer tool for exploring the system state in depth. OVIS 2.0 provides a variety of statistical tools for examining the behavior of elements in a cluster (e.g., nodes, racks) and associated resources (e.g., storage appliances and network switches). It calculates and reports model values and outliers relative to those models. Additionally, it provides an interactive 3D physical view in which the cluster elements can be colored by raw element values (e.g., temperatures, memory errors) or by the comparison of those values to a given model. The analysis tools and the visual display allow the user to easily determine abnormal or outlier behaviors. The OVIS project envisions the OVIS tool, when applied to compute cluster monitoring, to be used in conjunction with the scheduler or resource manager in order to enable intelligent resource utilization. For example, nodes that are deemed less healthy, that is, nodes that exhibit outlier behavior in some variable, or set of variables, that has shown to be correlated with future failure, can be discovered and assigned to shorter duration or less important jobs. Further, applications with fault-tolerant capabilities can invoke those mechanisms on demand, based upon notification of a node exhibiting impending failure conditions, rather than performing such mechanisms (e.g. checkpointing) at regular intervals unnecessarily.

【 预 览 】
附件列表
Files Size Format View
RO201705170002367LZ 2517KB PDF download
  文献评价指标  
  下载次数:18次 浏览次数:46次