{"title":"高性能计算交换系统的RAS建模","authors":"D. Tang, William Bryson, Richard Elling","doi":"10.1109/PRDC.2008.19","DOIUrl":null,"url":null,"abstract":"The high end of high performance computing (HPC) systems is now moving toward petascale deployments, delivering petaflops of computational capacity and petabytes of storage capacity. Interconnection of the sheer number of server nodes in an HPC system plays a vital role in the developments. InfiniBand has emerged as a compelling interconnect technology, and provides more scalability and significantly better cost- performance than any other known protocols. This paper presents a reliability, availability, and serviceability (RAS) modeling and analysis of the Sun Datacenter Switch 3456 system, the world's largest standards-based InfiniBand switch, with direct capacity to host up to 3,456 server nodes, against hardware faults. The results show that the system reliability, in terms of connectivity between the server nodes physically connected to the switch, is high for configurations with redundant ports. The study also shows that practicing deferred repair strategies can significantly reduce unscheduled service events and system downtime. Further, the study identifies optimal service strategies by a tradeoff analysis on reliability and availability.","PeriodicalId":369064,"journal":{"name":"2008 14th IEEE Pacific Rim International Symposium on Dependable Computing","volume":"142 ","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"RAS Modeling of an HPC Switch System\",\"authors\":\"D. Tang, William Bryson, Richard Elling\",\"doi\":\"10.1109/PRDC.2008.19\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The high end of high performance computing (HPC) systems is now moving toward petascale deployments, delivering petaflops of computational capacity and petabytes of storage capacity. Interconnection of the sheer number of server nodes in an HPC system plays a vital role in the developments. InfiniBand has emerged as a compelling interconnect technology, and provides more scalability and significantly better cost- performance than any other known protocols. This paper presents a reliability, availability, and serviceability (RAS) modeling and analysis of the Sun Datacenter Switch 3456 system, the world's largest standards-based InfiniBand switch, with direct capacity to host up to 3,456 server nodes, against hardware faults. The results show that the system reliability, in terms of connectivity between the server nodes physically connected to the switch, is high for configurations with redundant ports. The study also shows that practicing deferred repair strategies can significantly reduce unscheduled service events and system downtime. Further, the study identifies optimal service strategies by a tradeoff analysis on reliability and availability.\",\"PeriodicalId\":369064,\"journal\":{\"name\":\"2008 14th IEEE Pacific Rim International Symposium on Dependable Computing\",\"volume\":\"142 \",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2008-12-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2008 14th IEEE Pacific Rim International Symposium on Dependable Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PRDC.2008.19\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 14th IEEE Pacific Rim International Symposium on Dependable Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PRDC.2008.19","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
The high end of high performance computing (HPC) systems is now moving toward petascale deployments, delivering petaflops of computational capacity and petabytes of storage capacity. Interconnection of the sheer number of server nodes in an HPC system plays a vital role in the developments. InfiniBand has emerged as a compelling interconnect technology, and provides more scalability and significantly better cost- performance than any other known protocols. This paper presents a reliability, availability, and serviceability (RAS) modeling and analysis of the Sun Datacenter Switch 3456 system, the world's largest standards-based InfiniBand switch, with direct capacity to host up to 3,456 server nodes, against hardware faults. The results show that the system reliability, in terms of connectivity between the server nodes physically connected to the switch, is high for configurations with redundant ports. The study also shows that practicing deferred repair strategies can significantly reduce unscheduled service events and system downtime. Further, the study identifies optimal service strategies by a tradeoff analysis on reliability and availability.