高性能计算交换系统的RAS建模

D. Tang, William Bryson, Richard Elling
{"title":"高性能计算交换系统的RAS建模","authors":"D. Tang, William Bryson, Richard Elling","doi":"10.1109/PRDC.2008.19","DOIUrl":null,"url":null,"abstract":"The high end of high performance computing (HPC) systems is now moving toward petascale deployments, delivering petaflops of computational capacity and petabytes of storage capacity. Interconnection of the sheer number of server nodes in an HPC system plays a vital role in the developments. InfiniBand has emerged as a compelling interconnect technology, and provides more scalability and significantly better cost- performance than any other known protocols. This paper presents a reliability, availability, and serviceability (RAS) modeling and analysis of the Sun Datacenter Switch 3456 system, the world's largest standards-based InfiniBand switch, with direct capacity to host up to 3,456 server nodes, against hardware faults. The results show that the system reliability, in terms of connectivity between the server nodes physically connected to the switch, is high for configurations with redundant ports. The study also shows that practicing deferred repair strategies can significantly reduce unscheduled service events and system downtime. Further, the study identifies optimal service strategies by a tradeoff analysis on reliability and availability.","PeriodicalId":369064,"journal":{"name":"2008 14th IEEE Pacific Rim International Symposium on Dependable Computing","volume":"142 ","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"RAS Modeling of an HPC Switch System\",\"authors\":\"D. Tang, William Bryson, Richard Elling\",\"doi\":\"10.1109/PRDC.2008.19\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The high end of high performance computing (HPC) systems is now moving toward petascale deployments, delivering petaflops of computational capacity and petabytes of storage capacity. Interconnection of the sheer number of server nodes in an HPC system plays a vital role in the developments. InfiniBand has emerged as a compelling interconnect technology, and provides more scalability and significantly better cost- performance than any other known protocols. This paper presents a reliability, availability, and serviceability (RAS) modeling and analysis of the Sun Datacenter Switch 3456 system, the world's largest standards-based InfiniBand switch, with direct capacity to host up to 3,456 server nodes, against hardware faults. The results show that the system reliability, in terms of connectivity between the server nodes physically connected to the switch, is high for configurations with redundant ports. The study also shows that practicing deferred repair strategies can significantly reduce unscheduled service events and system downtime. Further, the study identifies optimal service strategies by a tradeoff analysis on reliability and availability.\",\"PeriodicalId\":369064,\"journal\":{\"name\":\"2008 14th IEEE Pacific Rim International Symposium on Dependable Computing\",\"volume\":\"142 \",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2008-12-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2008 14th IEEE Pacific Rim International Symposium on Dependable Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PRDC.2008.19\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 14th IEEE Pacific Rim International Symposium on Dependable Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PRDC.2008.19","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

高性能计算(HPC)系统的高端正朝着千万亿级部署方向发展,提供千万亿次的计算能力和千兆字节的存储能力。在高性能计算系统中,大量服务器节点的互连在发展中起着至关重要的作用。InfiniBand已经成为一种引人注目的互连技术,并提供比任何其他已知协议更多的可扩展性和显着更好的性价比。本文介绍了Sun数据中心交换机3456系统的可靠性、可用性和可服务性(RAS)建模和分析,该系统是世界上最大的基于标准的InfiniBand交换机,可承载多达3,456个服务器节点,可防止硬件故障。结果表明,对于具有冗余端口的配置,就物理连接到交换机的服务器节点之间的连通性而言,系统可靠性较高。研究还表明,实施延期维修策略可以显著减少计划外服务事件和系统停机时间。此外,该研究通过对可靠性和可用性的权衡分析确定了最佳服务策略。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
RAS Modeling of an HPC Switch System
The high end of high performance computing (HPC) systems is now moving toward petascale deployments, delivering petaflops of computational capacity and petabytes of storage capacity. Interconnection of the sheer number of server nodes in an HPC system plays a vital role in the developments. InfiniBand has emerged as a compelling interconnect technology, and provides more scalability and significantly better cost- performance than any other known protocols. This paper presents a reliability, availability, and serviceability (RAS) modeling and analysis of the Sun Datacenter Switch 3456 system, the world's largest standards-based InfiniBand switch, with direct capacity to host up to 3,456 server nodes, against hardware faults. The results show that the system reliability, in terms of connectivity between the server nodes physically connected to the switch, is high for configurations with redundant ports. The study also shows that practicing deferred repair strategies can significantly reduce unscheduled service events and system downtime. Further, the study identifies optimal service strategies by a tradeoff analysis on reliability and availability.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信