The Efficient In-band Management for Interconnect Network in Tianhe-2 System

Jijun Cao, Liquan Xiao, Zhengbin Pang, Kefei Wang, Jiaqing Xu
{"title":"The Efficient In-band Management for Interconnect Network in Tianhe-2 System","authors":"Jijun Cao, Liquan Xiao, Zhengbin Pang, Kefei Wang, Jiaqing Xu","doi":"10.1109/PDP.2016.58","DOIUrl":null,"url":null,"abstract":"Interconnect network plays an important role in high performance computing systems. And its manageability directly affects the RAS (i.e., Reliability, Availability, and Serviceability) of the whole system. The Tianhe-2 system located in NSCC-gz (i.e., National Supercomputing Center of China in Guangzhou) uses proprietary interconnect network, which includes 5,856 high-radix network router chips (i.e., NRC) and 18,304 network interface chips (i.e., NIC). For such a very large-scale interconnect network, it is a great challenge to manage (such as configure, monitor, and debug) the numerous network chips and its network ports in an efficient way. By implementing the in-band management with very few hardware resources, the interconnect network in Tianhe-2 system achieves a highly efficient network management. In this paper, we introduce the design and implementation of the in-band management for interconnect network in Tianhe-2 system, especially emphasizing on several key features, including the set of achieved management functionalities, the architecture of network management, the format of management packets, the data flow and processing of management packets, etc. In this paper, we also evaluate the performance of in-band management by mainly comparing with out-band management scheme. The preliminary results demonstrate the efficiency of the in-band management for interconnect network in Tianhe-2 system.","PeriodicalId":192273,"journal":{"name":"2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 24th Euromicro International Conference on Parallel, Distributed, and Network-Based Processing (PDP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PDP.2016.58","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

Abstract

Interconnect network plays an important role in high performance computing systems. And its manageability directly affects the RAS (i.e., Reliability, Availability, and Serviceability) of the whole system. The Tianhe-2 system located in NSCC-gz (i.e., National Supercomputing Center of China in Guangzhou) uses proprietary interconnect network, which includes 5,856 high-radix network router chips (i.e., NRC) and 18,304 network interface chips (i.e., NIC). For such a very large-scale interconnect network, it is a great challenge to manage (such as configure, monitor, and debug) the numerous network chips and its network ports in an efficient way. By implementing the in-band management with very few hardware resources, the interconnect network in Tianhe-2 system achieves a highly efficient network management. In this paper, we introduce the design and implementation of the in-band management for interconnect network in Tianhe-2 system, especially emphasizing on several key features, including the set of achieved management functionalities, the architecture of network management, the format of management packets, the data flow and processing of management packets, etc. In this paper, we also evaluate the performance of in-band management by mainly comparing with out-band management scheme. The preliminary results demonstrate the efficiency of the in-band management for interconnect network in Tianhe-2 system.
天河二号系统互联网络的高效带内管理
互连网络在高性能计算系统中起着重要的作用。其可管理性直接影响整个系统的RAS(即可靠性、可用性和可服务性)。天河二号系统位于广州NSCC-gz(即中国国家超级计算中心),采用专有互连网络,其中包括5856个高基数网络路由器芯片(即NRC)和18304个网络接口芯片(即NIC)。对于这样一个非常大规模的互连网络,如何有效地管理(如配置、监控和调试)众多的网络芯片及其网络端口是一个巨大的挑战。天河二号系统互联网络通过极少的硬件资源实现带内管理,实现了高效的网络管理。本文介绍了天河二号系统互联网络带内管理的设计与实现,重点介绍了实现的管理功能集、网络管理体系结构、管理数据包格式、管理数据包的数据流和处理等几个关键特点。本文还通过与带外管理方案的比较,对带内管理方案的性能进行了评价。初步结果验证了天河二号系统互联网络带内管理的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信