Performance optimization of load imbalanced workloads in large scale Dragonfly systems

B. Prisacari, G. Rodríguez, C. Minkenberg, Marina García, E. Vallejo, R. Beivide
{"title":"Performance optimization of load imbalanced workloads in large scale Dragonfly systems","authors":"B. Prisacari, G. Rodríguez, C. Minkenberg, Marina García, E. Vallejo, R. Beivide","doi":"10.1109/HPSR.2015.7483107","DOIUrl":null,"url":null,"abstract":"Dragonfly topologies are one of the most promising interconnect designs for enabling large, potentially exascale compute systems, particularly those envisioned to accommodate workloads that are sensitive to system diameter and end-to-end latency. They are cost-effective designs with a very low diameter and close to optimal performance for workloads which induce a balanced load across the network. However, these benefits are balanced by a reduced path diversity, which leaves Dragonflies vulnerable to certain adversarial traffic patterns. The performance of such workloads can be significantly improved using indirect routing approaches. However, the indirect routing approach that is most commonly used today exhibits in turn significant vulnerability to a subset of these traffic patterns for reasons that have not been, up to now entirely, understood. In exploring this vulnerability, we manage to provide a theoretical justification, based on inherent properties of the Dragonfly topology, of why performance degrades. Furthermore, we manage to isolate what specifically in the structure of a traffic pattern makes it a worst case in this context, and thus we are able to characterize the precise workload subset that will experience poor performance. By building upon the understanding of the interaction that causes sub-optimal behavior, we then show how simple changes to either the routing strategy or the process to node assignment can bring performance back close to ideal levels. Finally, we not only provide a theoretical justification for our performance models, but also validate them via comprehensive simulation-based studies of systems with up to 16,512 nodes.","PeriodicalId":360703,"journal":{"name":"2015 IEEE 16th International Conference on High Performance Switching and Routing (HPSR)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 IEEE 16th International Conference on High Performance Switching and Routing (HPSR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPSR.2015.7483107","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

Abstract

Dragonfly topologies are one of the most promising interconnect designs for enabling large, potentially exascale compute systems, particularly those envisioned to accommodate workloads that are sensitive to system diameter and end-to-end latency. They are cost-effective designs with a very low diameter and close to optimal performance for workloads which induce a balanced load across the network. However, these benefits are balanced by a reduced path diversity, which leaves Dragonflies vulnerable to certain adversarial traffic patterns. The performance of such workloads can be significantly improved using indirect routing approaches. However, the indirect routing approach that is most commonly used today exhibits in turn significant vulnerability to a subset of these traffic patterns for reasons that have not been, up to now entirely, understood. In exploring this vulnerability, we manage to provide a theoretical justification, based on inherent properties of the Dragonfly topology, of why performance degrades. Furthermore, we manage to isolate what specifically in the structure of a traffic pattern makes it a worst case in this context, and thus we are able to characterize the precise workload subset that will experience poor performance. By building upon the understanding of the interaction that causes sub-optimal behavior, we then show how simple changes to either the routing strategy or the process to node assignment can bring performance back close to ideal levels. Finally, we not only provide a theoretical justification for our performance models, but also validate them via comprehensive simulation-based studies of systems with up to 16,512 nodes.
大型蜻蜓系统中负载不平衡工作负载的性能优化
蜻蜓拓扑是最有前途的互连设计之一,用于支持大型、潜在的百亿亿级计算系统,特别是那些设想用于容纳对系统直径和端到端延迟敏感的工作负载的系统。它们是具有非常低直径和接近最佳性能的具有成本效益的设计,可以在整个网络中诱导均衡负载。然而,这些好处与路径多样性的减少相平衡,这使得蜻蜓容易受到某些敌对交通模式的影响。使用间接路由方法可以显著提高此类工作负载的性能。然而,目前最常用的间接路由方法反过来又对这些流量模式的一个子集显示出严重的漏洞,其原因到目前为止还没有完全了解。在探索这个漏洞的过程中,我们设法基于Dragonfly拓扑的固有属性,提供了性能下降的理论依据。此外,我们设法分离出流量模式结构中使其成为这种情况下最坏情况的具体内容,因此我们能够准确地描述将经历较差性能的工作负载子集。通过理解导致次优行为的交互,我们将展示如何对路由策略或节点分配过程进行简单更改,从而使性能恢复到接近理想水平。最后,我们不仅为我们的性能模型提供了理论依据,而且还通过对多达16,512个节点的系统进行全面的基于仿真的研究来验证它们。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信