{"title":"Cray X1分布式共享内存架构的性能评估","authors":"T. Dunigan, J. Vetter, P. Worley","doi":"10.1109/CONECT.2004.1375194","DOIUrl":null,"url":null,"abstract":"The Cray X1 supercomputer is a distributed shared memory vector multiprocessor, scalable to 4096 processors and up to 65 terabytes of memory. The X1's hierarchical design uses the basic building block of the multi-streaming processor (MSP), which is capable of 12.8 GF/s for 64-bit operations. The distributed shared memory (DSM) of the X1 presents a 64-bit global address space that is directly addressable from every MSP with an interconnect bandwidth per computation rate of one byte per floating point operation. Our results show that this high bandwidth and low latency for remote memory accesses translates into improved application performance on important applications, such as an Eulerian gyrokinetic-Maxwell solver. Furthermore, this architecture naturally supports programming models like the Cray shmem API, Unified Parallel C (UPC), and coarray FORTRAN (CAF), and it is imperative to select the appropriate models to exploit these features as our benchmarks demonstrate.","PeriodicalId":224195,"journal":{"name":"Proceedings. 12th Annual IEEE Symposium on High Performance Interconnects","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2004-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"56","resultStr":"{\"title\":\"Performance evaluation of the Cray X1 distributed shared memory architecture\",\"authors\":\"T. Dunigan, J. Vetter, P. Worley\",\"doi\":\"10.1109/CONECT.2004.1375194\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Cray X1 supercomputer is a distributed shared memory vector multiprocessor, scalable to 4096 processors and up to 65 terabytes of memory. The X1's hierarchical design uses the basic building block of the multi-streaming processor (MSP), which is capable of 12.8 GF/s for 64-bit operations. The distributed shared memory (DSM) of the X1 presents a 64-bit global address space that is directly addressable from every MSP with an interconnect bandwidth per computation rate of one byte per floating point operation. Our results show that this high bandwidth and low latency for remote memory accesses translates into improved application performance on important applications, such as an Eulerian gyrokinetic-Maxwell solver. Furthermore, this architecture naturally supports programming models like the Cray shmem API, Unified Parallel C (UPC), and coarray FORTRAN (CAF), and it is imperative to select the appropriate models to exploit these features as our benchmarks demonstrate.\",\"PeriodicalId\":224195,\"journal\":{\"name\":\"Proceedings. 12th Annual IEEE Symposium on High Performance Interconnects\",\"volume\":\"35 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2004-08-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"56\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings. 12th Annual IEEE Symposium on High Performance Interconnects\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CONECT.2004.1375194\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. 12th Annual IEEE Symposium on High Performance Interconnects","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CONECT.2004.1375194","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Performance evaluation of the Cray X1 distributed shared memory architecture
The Cray X1 supercomputer is a distributed shared memory vector multiprocessor, scalable to 4096 processors and up to 65 terabytes of memory. The X1's hierarchical design uses the basic building block of the multi-streaming processor (MSP), which is capable of 12.8 GF/s for 64-bit operations. The distributed shared memory (DSM) of the X1 presents a 64-bit global address space that is directly addressable from every MSP with an interconnect bandwidth per computation rate of one byte per floating point operation. Our results show that this high bandwidth and low latency for remote memory accesses translates into improved application performance on important applications, such as an Eulerian gyrokinetic-Maxwell solver. Furthermore, this architecture naturally supports programming models like the Cray shmem API, Unified Parallel C (UPC), and coarray FORTRAN (CAF), and it is imperative to select the appropriate models to exploit these features as our benchmarks demonstrate.