面向高存储带宽应用的小区宽带引擎性能分析

Daniel Jiménez-González, X. Martorell, Alex Ramírez
{"title":"面向高存储带宽应用的小区宽带引擎性能分析","authors":"Daniel Jiménez-González, X. Martorell, Alex Ramírez","doi":"10.1109/ISPASS.2007.363751","DOIUrl":null,"url":null,"abstract":"The cell broadband engine (CBE) is designed to be a general purpose platform exposing an enormous arithmetic performance due to its eight SIMD-only synergistic processor elements (SPEs), capable of achieving 134.4 GFLOPS (16.8 GFLOPS * 8) at 2.1 GHz, and a 64-bit power processor element (PPE). Each SPE has a 256Kb non-coherent local memory, and communicates to other SPEs and main memory through its DMA controller. CBE main memory is connected to all the CBE processor elements (PPE and SPEs) through the element interconnect bus (EIB), which has a 134.4 GB/s bandwidth performance peak at half the processor speed. Therefore, CBE platform is suitable to be used by applications using MPI and streaming programming models with a potential high performance peak. In this paper we focus on the communication part of those applications, and measure the actual memory bandwidth that each of the CBE processor components can sustain. We have measured the sustained bandwidth between PPE and memory, SPE and memory, two individual SPEs to determine if this bandwidth depends on their physical location, pairs of SPEs to achieve maximum bandwidth in nearly-ideal conditions, and in a cycle of SPEs representing a streaming kind of computation. Our results on a real machine show that following some strict programming rules, individual SPE to SPE communication almost achieves the peak bandwidth when using the DMA controllers to transfer memory chunks of at least 1024 Bytes. In addition, SPE to memory bandwidth should be considered in streaming programming. For instance, implementing two data streams using 4 SPEs each can be more efficient than having a single data stream using the 8 SPEs","PeriodicalId":439151,"journal":{"name":"2007 IEEE International Symposium on Performance Analysis of Systems & Software","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"32","resultStr":"{\"title\":\"Performance Analysis of Cell Broadband Engine for High Memory Bandwidth Applications\",\"authors\":\"Daniel Jiménez-González, X. Martorell, Alex Ramírez\",\"doi\":\"10.1109/ISPASS.2007.363751\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The cell broadband engine (CBE) is designed to be a general purpose platform exposing an enormous arithmetic performance due to its eight SIMD-only synergistic processor elements (SPEs), capable of achieving 134.4 GFLOPS (16.8 GFLOPS * 8) at 2.1 GHz, and a 64-bit power processor element (PPE). Each SPE has a 256Kb non-coherent local memory, and communicates to other SPEs and main memory through its DMA controller. CBE main memory is connected to all the CBE processor elements (PPE and SPEs) through the element interconnect bus (EIB), which has a 134.4 GB/s bandwidth performance peak at half the processor speed. Therefore, CBE platform is suitable to be used by applications using MPI and streaming programming models with a potential high performance peak. In this paper we focus on the communication part of those applications, and measure the actual memory bandwidth that each of the CBE processor components can sustain. We have measured the sustained bandwidth between PPE and memory, SPE and memory, two individual SPEs to determine if this bandwidth depends on their physical location, pairs of SPEs to achieve maximum bandwidth in nearly-ideal conditions, and in a cycle of SPEs representing a streaming kind of computation. Our results on a real machine show that following some strict programming rules, individual SPE to SPE communication almost achieves the peak bandwidth when using the DMA controllers to transfer memory chunks of at least 1024 Bytes. In addition, SPE to memory bandwidth should be considered in streaming programming. For instance, implementing two data streams using 4 SPEs each can be more efficient than having a single data stream using the 8 SPEs\",\"PeriodicalId\":439151,\"journal\":{\"name\":\"2007 IEEE International Symposium on Performance Analysis of Systems & Software\",\"volume\":\"36 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2007-04-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"32\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2007 IEEE International Symposium on Performance Analysis of Systems & Software\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISPASS.2007.363751\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2007 IEEE International Symposium on Performance Analysis of Systems & Software","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISPASS.2007.363751","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 32

摘要

蜂窝宽带引擎(CBE)被设计成一个通用平台,由于其8个仅simd的协同处理器元件(spe),能够在2.1 GHz下实现134.4 GFLOPS (16.8 GFLOPS * 8),以及一个64位功率处理器元件(PPE),从而暴露出巨大的算术性能。每个SPE具有256Kb的非相干本地内存,并通过其DMA控制器与其他SPE和主内存通信。CBE主存通过元素互连总线(EIB)连接到所有CBE处理器单元(PPE和spe),在处理器速度的一半下,其带宽性能峰值为134.4 GB/s。因此,CBE平台适用于使用MPI和流编程模型的应用程序,这些应用程序具有潜在的高性能峰值。在本文中,我们将重点关注这些应用程序的通信部分,并测量每个CBE处理器组件可以承受的实际内存带宽。我们测量了PPE和内存之间的持续带宽、SPE和内存之间的持续带宽、两个单独的SPE之间的持续带宽,以确定该带宽是否依赖于它们的物理位置、在近乎理想的条件下实现最大带宽的对SPE,以及在一个表示流计算的SPE周期中。我们在真实机器上的结果表明,遵循一些严格的编程规则,当使用DMA控制器传输至少1024字节的内存块时,单个SPE到SPE通信几乎可以达到峰值带宽。此外,在流编程中应该考虑SPE对内存带宽的占用。例如,使用4个spe实现两个数据流比使用8个spe实现单个数据流更有效
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Performance Analysis of Cell Broadband Engine for High Memory Bandwidth Applications
The cell broadband engine (CBE) is designed to be a general purpose platform exposing an enormous arithmetic performance due to its eight SIMD-only synergistic processor elements (SPEs), capable of achieving 134.4 GFLOPS (16.8 GFLOPS * 8) at 2.1 GHz, and a 64-bit power processor element (PPE). Each SPE has a 256Kb non-coherent local memory, and communicates to other SPEs and main memory through its DMA controller. CBE main memory is connected to all the CBE processor elements (PPE and SPEs) through the element interconnect bus (EIB), which has a 134.4 GB/s bandwidth performance peak at half the processor speed. Therefore, CBE platform is suitable to be used by applications using MPI and streaming programming models with a potential high performance peak. In this paper we focus on the communication part of those applications, and measure the actual memory bandwidth that each of the CBE processor components can sustain. We have measured the sustained bandwidth between PPE and memory, SPE and memory, two individual SPEs to determine if this bandwidth depends on their physical location, pairs of SPEs to achieve maximum bandwidth in nearly-ideal conditions, and in a cycle of SPEs representing a streaming kind of computation. Our results on a real machine show that following some strict programming rules, individual SPE to SPE communication almost achieves the peak bandwidth when using the DMA controllers to transfer memory chunks of at least 1024 Bytes. In addition, SPE to memory bandwidth should be considered in streaming programming. For instance, implementing two data streams using 4 SPEs each can be more efficient than having a single data stream using the 8 SPEs
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信