Practical Resource Monitoring for Robust High Throughput Computing

G. Juve, Benjamín Tovar, Rafael Ferreira da Silva, Dariusz Król, D. Thain, E. Deelman, W. Allcock, M. Livny
{"title":"Practical Resource Monitoring for Robust High Throughput Computing","authors":"G. Juve, Benjamín Tovar, Rafael Ferreira da Silva, Dariusz Król, D. Thain, E. Deelman, W. Allcock, M. Livny","doi":"10.1109/CLUSTER.2015.115","DOIUrl":null,"url":null,"abstract":"Robust high throughput computing requires effective monitoring and enforcement of a variety of resources including CPU cores, memory, disk, and network traffic. Without effective monitoring and enforcement, it is easy to overload machines, causing failures and slowdowns, or underutilize machines, which results in wasted opportunities. This paper explores how to describe, measure, and enforce resources used by computational tasks. We focus on tasks running in distributed execution systems, in which a task requests the resources it needs, and the execution system ensures the availability of such resources. This presents two non-trivial problems: how to measure the resources consumed by a task, and how to monitor and report resource exhaustion in a robust and timely manner. For both of these tasks, operating systems have a variety of mechanisms with different degrees of availability, accuracy, overhead, and intrusiveness. We describe various forms of monitoring and the available mechanisms in contemporary operating systems. We then present two specific monitoring tools that choose different tradeoffs in overhead and accuracy, and evaluate them on a selection of benchmarks.","PeriodicalId":187042,"journal":{"name":"2015 IEEE International Conference on Cluster Computing","volume":"110 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"29","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE International Conference on Cluster Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CLUSTER.2015.115","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 29

Abstract

Robust high throughput computing requires effective monitoring and enforcement of a variety of resources including CPU cores, memory, disk, and network traffic. Without effective monitoring and enforcement, it is easy to overload machines, causing failures and slowdowns, or underutilize machines, which results in wasted opportunities. This paper explores how to describe, measure, and enforce resources used by computational tasks. We focus on tasks running in distributed execution systems, in which a task requests the resources it needs, and the execution system ensures the availability of such resources. This presents two non-trivial problems: how to measure the resources consumed by a task, and how to monitor and report resource exhaustion in a robust and timely manner. For both of these tasks, operating systems have a variety of mechanisms with different degrees of availability, accuracy, overhead, and intrusiveness. We describe various forms of monitoring and the available mechanisms in contemporary operating systems. We then present two specific monitoring tools that choose different tradeoffs in overhead and accuracy, and evaluate them on a selection of benchmarks.
鲁棒高吞吐量计算的实用资源监控
健壮的高吞吐量计算需要有效地监控和执行各种资源,包括CPU内核、内存、磁盘和网络流量。如果没有有效的监控和执行,很容易使机器过载,导致故障和减速,或者机器利用率不足,从而导致浪费机会。本文探讨了如何描述、度量和强制执行计算任务所使用的资源。我们关注在分布式执行系统中运行的任务,其中任务请求它所需的资源,执行系统确保这些资源的可用性。这就提出了两个重要的问题:如何度量任务消耗的资源,以及如何以可靠和及时的方式监视和报告资源耗尽情况。对于这两项任务,操作系统具有各种不同程度的可用性、准确性、开销和侵入性的机制。我们描述了各种形式的监视和当代操作系统中可用的机制。然后,我们介绍了两个特定的监视工具,它们在开销和准确性方面选择了不同的权衡,并在一系列基准测试中对它们进行了评估。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信