面向高存储带宽应用的小区宽带引擎性能分析

2007 IEEE International Symposium on Performance Analysis of Systems & Software Pub Date : 2007-04-25 DOI:10.1109/ISPASS.2007.363751

Daniel Jiménez-González, X. Martorell, Alex Ramírez

{"title":"面向高存储带宽应用的小区宽带引擎性能分析","authors":"Daniel Jiménez-González, X. Martorell, Alex Ramírez","doi":"10.1109/ISPASS.2007.363751","DOIUrl":null,"url":null,"abstract":"The cell broadband engine (CBE) is designed to be a general purpose platform exposing an enormous arithmetic performance due to its eight SIMD-only synergistic processor elements (SPEs), capable of achieving 134.4 GFLOPS (16.8 GFLOPS * 8) at 2.1 GHz, and a 64-bit power processor element (PPE). Each SPE has a 256Kb non-coherent local memory, and communicates to other SPEs and main memory through its DMA controller. CBE main memory is connected to all the CBE processor elements (PPE and SPEs) through the element interconnect bus (EIB), which has a 134.4 GB/s bandwidth performance peak at half the processor speed. Therefore, CBE platform is suitable to be used by applications using MPI and streaming programming models with a potential high performance peak. In this paper we focus on the communication part of those applications, and measure the actual memory bandwidth that each of the CBE processor components can sustain. We have measured the sustained bandwidth between PPE and memory, SPE and memory, two individual SPEs to determine if this bandwidth depends on their physical location, pairs of SPEs to achieve maximum bandwidth in nearly-ideal conditions, and in a cycle of SPEs representing a streaming kind of computation. Our results on a real machine show that following some strict programming rules, individual SPE to SPE communication almost achieves the peak bandwidth when using the DMA controllers to transfer memory chunks of at least 1024 Bytes. In addition, SPE to memory bandwidth should be considered in streaming programming. For instance, implementing two data streams using 4 SPEs each can be more efficient than having a single data stream using the 8 SPEs","PeriodicalId":439151,"journal":{"name":"2007 IEEE International Symposium on Performance Analysis of Systems & Software","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2007-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"32","resultStr":"{\"title\":\"Performance Analysis of Cell Broadband Engine for High Memory Bandwidth Applications\",\"authors\":\"Daniel Jiménez-González, X. Martorell, Alex Ramírez\",\"doi\":\"10.1109/ISPASS.2007.363751\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The cell broadband engine (CBE) is designed to be a general purpose platform exposing an enormous arithmetic performance due to its eight SIMD-only synergistic processor elements (SPEs), capable of achieving 134.4 GFLOPS (16.8 GFLOPS * 8) at 2.1 GHz, and a 64-bit power processor element (PPE). Each SPE has a 256Kb non-coherent local memory, and communicates to other SPEs and main memory through its DMA controller. CBE main memory is connected to all the CBE processor elements (PPE and SPEs) through the element interconnect bus (EIB), which has a 134.4 GB/s bandwidth performance peak at half the processor speed. Therefore, CBE platform is suitable to be used by applications using MPI and streaming programming models with a potential high performance peak. In this paper we focus on the communication part of those applications, and measure the actual memory bandwidth that each of the CBE processor components can sustain. We have measured the sustained bandwidth between PPE and memory, SPE and memory, two individual SPEs to determine if this bandwidth depends on their physical location, pairs of SPEs to achieve maximum bandwidth in nearly-ideal conditions, and in a cycle of SPEs representing a streaming kind of computation. Our results on a real machine show that following some strict programming rules, individual SPE to SPE communication almost achieves the peak bandwidth when using the DMA controllers to transfer memory chunks of at least 1024 Bytes. In addition, SPE to memory bandwidth should be considered in streaming programming. For instance, implementing two data streams using 4 SPEs each can be more efficient than having a single data stream using the 8 SPEs\",\"PeriodicalId\":439151,\"journal\":{\"name\":\"2007 IEEE International Symposium on Performance Analysis of Systems & Software\",\"volume\":\"36 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2007-04-25\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"32\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2007 IEEE International Symposium on Performance Analysis of Systems & Software\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ISPASS.2007.363751\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2007 IEEE International Symposium on Performance Analysis of Systems & Software","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISPASS.2007.363751","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 32

摘要

蜂窝宽带引擎(CBE)被设计成一个通用平台，由于其8个仅simd的协同处理器元件(spe)，能够在2.1 GHz下实现134.4 GFLOPS (16.8 GFLOPS * 8)，以及一个64位功率处理器元件(PPE)，从而暴露出巨大的算术性能。每个SPE具有256Kb的非相干本地内存，并通过其DMA控制器与其他SPE和主内存通信。CBE主存通过元素互连总线(EIB)连接到所有CBE处理器单元(PPE和spe)，在处理器速度的一半下，其带宽性能峰值为134.4 GB/s。因此，CBE平台适用于使用MPI和流编程模型的应用程序，这些应用程序具有潜在的高性能峰值。在本文中，我们将重点关注这些应用程序的通信部分，并测量每个CBE处理器组件可以承受的实际内存带宽。我们测量了PPE和内存之间的持续带宽、SPE和内存之间的持续带宽、两个单独的SPE之间的持续带宽，以确定该带宽是否依赖于它们的物理位置、在近乎理想的条件下实现最大带宽的对SPE，以及在一个表示流计算的SPE周期中。我们在真实机器上的结果表明，遵循一些严格的编程规则，当使用DMA控制器传输至少1024字节的内存块时，单个SPE到SPE通信几乎可以达到峰值带宽。此外，在流编程中应该考虑SPE对内存带宽的占用。例如，使用4个spe实现两个数据流比使用8个spe实现单个数据流更有效

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Performance Analysis of Cell Broadband Engine for High Memory Bandwidth Applications

The cell broadband engine (CBE) is designed to be a general purpose platform exposing an enormous arithmetic performance due to its eight SIMD-only synergistic processor elements (SPEs), capable of achieving 134.4 GFLOPS (16.8 GFLOPS * 8) at 2.1 GHz, and a 64-bit power processor element (PPE). Each SPE has a 256Kb non-coherent local memory, and communicates to other SPEs and main memory through its DMA controller. CBE main memory is connected to all the CBE processor elements (PPE and SPEs) through the element interconnect bus (EIB), which has a 134.4 GB/s bandwidth performance peak at half the processor speed. Therefore, CBE platform is suitable to be used by applications using MPI and streaming programming models with a potential high performance peak. In this paper we focus on the communication part of those applications, and measure the actual memory bandwidth that each of the CBE processor components can sustain. We have measured the sustained bandwidth between PPE and memory, SPE and memory, two individual SPEs to determine if this bandwidth depends on their physical location, pairs of SPEs to achieve maximum bandwidth in nearly-ideal conditions, and in a cycle of SPEs representing a streaming kind of computation. Our results on a real machine show that following some strict programming rules, individual SPE to SPE communication almost achieves the peak bandwidth when using the DMA controllers to transfer memory chunks of at least 1024 Bytes. In addition, SPE to memory bandwidth should be considered in streaming programming. For instance, implementing two data streams using 4 SPEs each can be more efficient than having a single data stream using the 8 SPEs

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2007 IEEE International Symposium on Performance Analysis of Systems & Software

自引率

0.00%

发文量