Moneo:在AI基础设施中非侵入性地监控细粒度指标

Q3 Computer Science
Yuting Jiang, Yifan Xiong, L. Qu, Cheng Luo, Chen Tian, Peng Cheng, Y. Xiong
{"title":"Moneo:在AI基础设施中非侵入性地监控细粒度指标","authors":"Yuting Jiang, Yifan Xiong, L. Qu, Cheng Luo, Chen Tian, Peng Cheng, Y. Xiong","doi":"10.1145/3544497.3544501","DOIUrl":null,"url":null,"abstract":"Cloud-based AI infrastructure is becoming increasingly important, especially on large-scale distributed training. To improve its efficiency and serviceability, real-time monitoring of the infrastructure and workload profiling are proved to be the effective approach empirically. However, cloud environment poses great challenges as service providers cannot interfere with their tenants’ workloads or touch user data, thus previous instrumentation-based monitoring approach cannot be applied, nor does the workload trace collection. In this paper, we propose Moneo, a non-intrusive cloudfriendly monitoring system for AI infrastructure. Moneo is capable of intelligently collecting the key architecture-level metrics at finer granularity in real-time without instrumenting or tracing the workloads, which has been deployed in real production cloud, Azure. We analyze the results reported by Moneo for typical large-scale distributed AI workloads from real deployment. Results demonstrate that Moneo can effectively help service providers understand the real resource usage patterns of various AI workloads and real networking requirements, so as to get valuable findings help improve the efficiency of cloud infrastructure and optimize the software stack with the consideration of the characteristic resource usage requirements for different AI workloads. This is a revised version of the symposium paper [23] presented in IEEE ICC 2022 originally.","PeriodicalId":38935,"journal":{"name":"Operating Systems Review (ACM)","volume":"56 1","pages":"18-25"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Moneo: Monitoring Fine-grained Metrics Nonintrusively in AI Infrastructure\",\"authors\":\"Yuting Jiang, Yifan Xiong, L. Qu, Cheng Luo, Chen Tian, Peng Cheng, Y. Xiong\",\"doi\":\"10.1145/3544497.3544501\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Cloud-based AI infrastructure is becoming increasingly important, especially on large-scale distributed training. To improve its efficiency and serviceability, real-time monitoring of the infrastructure and workload profiling are proved to be the effective approach empirically. However, cloud environment poses great challenges as service providers cannot interfere with their tenants’ workloads or touch user data, thus previous instrumentation-based monitoring approach cannot be applied, nor does the workload trace collection. In this paper, we propose Moneo, a non-intrusive cloudfriendly monitoring system for AI infrastructure. Moneo is capable of intelligently collecting the key architecture-level metrics at finer granularity in real-time without instrumenting or tracing the workloads, which has been deployed in real production cloud, Azure. We analyze the results reported by Moneo for typical large-scale distributed AI workloads from real deployment. Results demonstrate that Moneo can effectively help service providers understand the real resource usage patterns of various AI workloads and real networking requirements, so as to get valuable findings help improve the efficiency of cloud infrastructure and optimize the software stack with the consideration of the characteristic resource usage requirements for different AI workloads. This is a revised version of the symposium paper [23] presented in IEEE ICC 2022 originally.\",\"PeriodicalId\":38935,\"journal\":{\"name\":\"Operating Systems Review (ACM)\",\"volume\":\"56 1\",\"pages\":\"18-25\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Operating Systems Review (ACM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3544497.3544501\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"Computer Science\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Operating Systems Review (ACM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3544497.3544501","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Computer Science","Score":null,"Total":0}
引用次数: 2

摘要

基于云的人工智能基础设施正变得越来越重要,特别是在大规模分布式训练中。为了提高其效率和可维护性,实践证明,对基础设施进行实时监控和工作负载分析是有效的方法。然而,云环境带来了巨大的挑战,因为服务提供商无法干扰其租户的工作负载或触摸用户数据,因此无法应用以前基于仪器的监控方法,也无法应用工作负载跟踪收集。在本文中,我们提出了Moneo,一个非侵入式的云友好型人工智能基础设施监控系统。Moneo能够以更细的粒度实时智能地收集关键架构级指标,而无需检测或跟踪工作负载,这已经部署在真实的生产云Azure中。我们分析了Moneo报告的来自实际部署的典型大规模分布式AI工作负载的结果。结果表明,Moneo可以有效地帮助服务提供商了解各种AI工作负载的真实资源使用模式和真实的组网需求,从而在考虑不同AI工作负载的特征资源使用需求的情况下,获得有助于提高云基础设施效率和优化软件堆栈的有价值的发现。这是IEEE ICC 2022上发表的研讨会论文[23]的修订版。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Moneo: Monitoring Fine-grained Metrics Nonintrusively in AI Infrastructure
Cloud-based AI infrastructure is becoming increasingly important, especially on large-scale distributed training. To improve its efficiency and serviceability, real-time monitoring of the infrastructure and workload profiling are proved to be the effective approach empirically. However, cloud environment poses great challenges as service providers cannot interfere with their tenants’ workloads or touch user data, thus previous instrumentation-based monitoring approach cannot be applied, nor does the workload trace collection. In this paper, we propose Moneo, a non-intrusive cloudfriendly monitoring system for AI infrastructure. Moneo is capable of intelligently collecting the key architecture-level metrics at finer granularity in real-time without instrumenting or tracing the workloads, which has been deployed in real production cloud, Azure. We analyze the results reported by Moneo for typical large-scale distributed AI workloads from real deployment. Results demonstrate that Moneo can effectively help service providers understand the real resource usage patterns of various AI workloads and real networking requirements, so as to get valuable findings help improve the efficiency of cloud infrastructure and optimize the software stack with the consideration of the characteristic resource usage requirements for different AI workloads. This is a revised version of the symposium paper [23] presented in IEEE ICC 2022 originally.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Operating Systems Review (ACM)
Operating Systems Review (ACM) Computer Science-Computer Networks and Communications
CiteScore
2.80
自引率
0.00%
发文量
10
期刊介绍: Operating Systems Review (OSR) is a publication of the ACM Special Interest Group on Operating Systems (SIGOPS), whose scope of interest includes: computer operating systems and architecture for multiprogramming, multiprocessing, and time sharing; resource management; evaluation and simulation; reliability, integrity, and security of data; communications among computing processors; and computer system modeling and analysis.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信