Simulation Framework for Studying Optical Cable Failures in Dragonfly Topologies

Tiffany Connors, Taylor L. Groves, Tony Quan, K. Hemmert
{"title":"Simulation Framework for Studying Optical Cable Failures in Dragonfly Topologies","authors":"Tiffany Connors, Taylor L. Groves, Tony Quan, K. Hemmert","doi":"10.1109/IPDPSW.2019.00141","DOIUrl":null,"url":null,"abstract":"In high performance computing (HPC) systems, optical network links are often utilized for the HPC networks of these systems, but they have a relatively high rate of failure compared to their electrical counterparts. Due to the high link failure rate, evaluating the impact of these failures on HPC workloads is of particular interest. We extended the Merlin network module of the Structural Simulation Toolkit (SST) in order to evaluate the impact of link failures on applications running on HPC systems which use dragonfly network topologies.We focus on dragonfly topologies as they are frequently found in HPC systems, including NERSC Cori and Edison systems.We demonstrate our changes to SST by providing a sample of performance results and routing statistics for a dragonfly network of 8,192 nodes and three HPC workloads with 1% of optical link failures. For the three motifs under consideration, we show that the impact of link failure is largely dependent on the underlying workloads running on the system.","PeriodicalId":292054,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW.2019.00141","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

In high performance computing (HPC) systems, optical network links are often utilized for the HPC networks of these systems, but they have a relatively high rate of failure compared to their electrical counterparts. Due to the high link failure rate, evaluating the impact of these failures on HPC workloads is of particular interest. We extended the Merlin network module of the Structural Simulation Toolkit (SST) in order to evaluate the impact of link failures on applications running on HPC systems which use dragonfly network topologies.We focus on dragonfly topologies as they are frequently found in HPC systems, including NERSC Cori and Edison systems.We demonstrate our changes to SST by providing a sample of performance results and routing statistics for a dragonfly network of 8,192 nodes and three HPC workloads with 1% of optical link failures. For the three motifs under consideration, we show that the impact of link failure is largely dependent on the underlying workloads running on the system.
蜻蜓拓扑下研究光缆故障的仿真框架
在高性能计算(HPC)系统中,光网络链路通常用于这些系统的HPC网络,但是与它们的电子对等物相比,它们具有相对较高的故障率。由于高链路故障率,评估这些故障对HPC工作负载的影响是特别有趣的。我们扩展了结构模拟工具包(SST)的Merlin网络模块,以评估链路故障对运行在使用蜻蜓网络拓扑的高性能计算系统上的应用程序的影响。我们关注蜻蜓拓扑,因为它们经常出现在高性能计算系统中,包括NERSC Cori和Edison系统。我们通过提供8192个节点的蜻蜓网络和三个高性能计算工作负载的性能结果和路由统计数据样本来展示我们对SST的改变,其中光链路故障率为1%。对于所考虑的三个主题,我们表明链路故障的影响在很大程度上取决于系统上运行的底层工作负载。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信