Cray X1分布式共享内存架构的性能评估

Proceedings. 12th Annual IEEE Symposium on High Performance Interconnects Pub Date : 2004-08-05 DOI:10.1109/CONECT.2004.1375194

T. Dunigan, J. Vetter, P. Worley

{"title":"Cray X1分布式共享内存架构的性能评估","authors":"T. Dunigan, J. Vetter, P. Worley","doi":"10.1109/CONECT.2004.1375194","DOIUrl":null,"url":null,"abstract":"The Cray X1 supercomputer is a distributed shared memory vector multiprocessor, scalable to 4096 processors and up to 65 terabytes of memory. The X1's hierarchical design uses the basic building block of the multi-streaming processor (MSP), which is capable of 12.8 GF/s for 64-bit operations. The distributed shared memory (DSM) of the X1 presents a 64-bit global address space that is directly addressable from every MSP with an interconnect bandwidth per computation rate of one byte per floating point operation. Our results show that this high bandwidth and low latency for remote memory accesses translates into improved application performance on important applications, such as an Eulerian gyrokinetic-Maxwell solver. Furthermore, this architecture naturally supports programming models like the Cray shmem API, Unified Parallel C (UPC), and coarray FORTRAN (CAF), and it is imperative to select the appropriate models to exploit these features as our benchmarks demonstrate.","PeriodicalId":224195,"journal":{"name":"Proceedings. 12th Annual IEEE Symposium on High Performance Interconnects","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2004-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"56","resultStr":"{\"title\":\"Performance evaluation of the Cray X1 distributed shared memory architecture\",\"authors\":\"T. Dunigan, J. Vetter, P. Worley\",\"doi\":\"10.1109/CONECT.2004.1375194\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The Cray X1 supercomputer is a distributed shared memory vector multiprocessor, scalable to 4096 processors and up to 65 terabytes of memory. The X1's hierarchical design uses the basic building block of the multi-streaming processor (MSP), which is capable of 12.8 GF/s for 64-bit operations. The distributed shared memory (DSM) of the X1 presents a 64-bit global address space that is directly addressable from every MSP with an interconnect bandwidth per computation rate of one byte per floating point operation. Our results show that this high bandwidth and low latency for remote memory accesses translates into improved application performance on important applications, such as an Eulerian gyrokinetic-Maxwell solver. Furthermore, this architecture naturally supports programming models like the Cray shmem API, Unified Parallel C (UPC), and coarray FORTRAN (CAF), and it is imperative to select the appropriate models to exploit these features as our benchmarks demonstrate.\",\"PeriodicalId\":224195,\"journal\":{\"name\":\"Proceedings. 12th Annual IEEE Symposium on High Performance Interconnects\",\"volume\":\"35 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2004-08-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"56\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings. 12th Annual IEEE Symposium on High Performance Interconnects\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CONECT.2004.1375194\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings. 12th Annual IEEE Symposium on High Performance Interconnects","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CONECT.2004.1375194","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 56

摘要

克雷X1超级计算机是一个分布式共享内存矢量多处理器，可扩展到4096个处理器和高达65tb的内存。X1的分层设计使用了多流处理器(MSP)的基本构建块，其64位操作的速度为12.8 GF/s。X1的分布式共享内存(DSM)提供了一个64位全局地址空间，可以从每个MSP直接寻址，每个浮点操作的每个计算率为一个字节的互连带宽。我们的结果表明，远程内存访问的高带宽和低延迟转化为重要应用程序的应用程序性能的改进，例如欧拉陀螺仪-麦克斯韦求解器。此外，这种体系结构自然地支持编程模型，如Cray shmem API、统一并行C (UPC)和coarray FORTRAN (CAF)，并且必须选择合适的模型来利用我们的基准测试所展示的这些特性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Performance evaluation of the Cray X1 distributed shared memory architecture

The Cray X1 supercomputer is a distributed shared memory vector multiprocessor, scalable to 4096 processors and up to 65 terabytes of memory. The X1's hierarchical design uses the basic building block of the multi-streaming processor (MSP), which is capable of 12.8 GF/s for 64-bit operations. The distributed shared memory (DSM) of the X1 presents a 64-bit global address space that is directly addressable from every MSP with an interconnect bandwidth per computation rate of one byte per floating point operation. Our results show that this high bandwidth and low latency for remote memory accesses translates into improved application performance on important applications, such as an Eulerian gyrokinetic-Maxwell solver. Furthermore, this architecture naturally supports programming models like the Cray shmem API, Unified Parallel C (UPC), and coarray FORTRAN (CAF), and it is imperative to select the appropriate models to exploit these features as our benchmarks demonstrate.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings. 12th Annual IEEE Symposium on High Performance Interconnects

自引率

0.00%

发文量