Total recall: holistic metrics for broad systems performance and user experience visibility in a data-intensive computing environment

HUST '15 Pub Date : 2015-11-15 DOI:10.1145/2834996.2835001

Erich Birngruber, Petar Forai, Aaron Zauner

引用次数: 5

Abstract

User support personnel, systems engineers, and administrators of HPC installations need to be aware of log and telemetry information from different systems in order to perform routine tasks ranging from systems management to user inquiries. We present an integrated, distributed HPC tailored monitoring system, based on a current generation software stack from the DevOps community, with integration into the work load management system. The goal of this system is to provide a quicker turnaround time for user inquiries in response to errors. Dashboards provide an overlay of system and node level events on top of correlated metrics data. This information is directly available for querying, manipulation, and filtering, allowing statistical analysis and aggregation of collected data. Furthermore, additional dashboards offer in-sight into how users are interacting with available resources and pin-point fluctuations in utilization. The system can integrate sources of information from other monitoring solutions and event-based sources.

查看原文本刊更多论文

总召回:在数据密集型计算环境中用于广泛系统性能和用户体验可见性的整体度量

用户支持人员、系统工程师和HPC安装管理员需要了解来自不同系统的日志和遥测信息，以便执行从系统管理到用户查询的日常任务。我们提出了一个集成的、分布式的高性能计算监控系统，基于DevOps社区的当前一代软件堆栈，并集成到工作负载管理系统中。该系统的目标是为响应错误的用户查询提供更快的周转时间。仪表板在相关指标数据之上提供系统和节点级事件的覆盖。此信息可直接用于查询、操作和过滤，从而允许对收集的数据进行统计分析和聚合。此外，额外的仪表板提供了对用户如何与可用资源进行交互以及利用率的精确点波动的洞察。该系统可以集成来自其他监控解决方案的信息源和基于事件的信息源。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

HUST '15

自引率

0.00%

发文量