A Performance Counter Based Workload Characterization on Blue Gene/P

2008 37th International Conference on Parallel Processing Pub Date : 2008-09-09 DOI:10.1109/ICPP.2008.57

K. Ganesan, L. John, V. Salapura, J. Sexton

{"title":"A Performance Counter Based Workload Characterization on Blue Gene/P","authors":"K. Ganesan, L. John, V. Salapura, J. Sexton","doi":"10.1109/ICPP.2008.57","DOIUrl":null,"url":null,"abstract":"IBM's Blue Gene/P, the second generation of the Blue Genesupercomputer is designed with a Universal Performance Counter (UPC) Unit at each node capable of monitoring 256 events concurrently, unlike many microprocessors that provide only a few performance counters. In this paper we demonstrate the efficacy of the interface library that we have developed, taking advantage of the UPC unit, enabling users to effortlessly instrument applications and get a profound insight into its execution on the Blue Gene/P system which could scale in thousands of nodes. The interface library allows the user to monitor about 512 performance related events out of a total of 1024 possible events and aggregate the data collected at different nodes and compute meaningful metrics through data mining.Using the developed interface, we instrumented the NAS parallel benchmarks and collected the performance counter data. We studied the MFLOPS, L3-DDR Traffic and the dynamic instruction mix based on the counters in the FPU and the cache hierarchy for different compiler optimizations, modes of operations of the system and different L3, L2 configurations for the NAS benchmarks. Our analysis identifies that compiler optimization O5 along with \"-qarch440d\", which uses the architectural information of the chip in optimization, is very effective in incorporating a lot of SIMD instructions and results in the most efficient execution of the benchmarks. The experiments on the L3 size indicate that an L3 size of 4MB is optimal for the NAS benchmarks and they do not benefit by increasing it further. Also, the virtual node mode of operation of the Blue Gene/P system is very effective and yields superior performance for the selected benchmarks taking advantage of the chip multiprocessor architecture of the quad-core HPC chip.","PeriodicalId":388408,"journal":{"name":"2008 37th International Conference on Parallel Processing","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"25","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 37th International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPP.2008.57","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 25

Abstract

IBM's Blue Gene/P, the second generation of the Blue Genesupercomputer is designed with a Universal Performance Counter (UPC) Unit at each node capable of monitoring 256 events concurrently, unlike many microprocessors that provide only a few performance counters. In this paper we demonstrate the efficacy of the interface library that we have developed, taking advantage of the UPC unit, enabling users to effortlessly instrument applications and get a profound insight into its execution on the Blue Gene/P system which could scale in thousands of nodes. The interface library allows the user to monitor about 512 performance related events out of a total of 1024 possible events and aggregate the data collected at different nodes and compute meaningful metrics through data mining.Using the developed interface, we instrumented the NAS parallel benchmarks and collected the performance counter data. We studied the MFLOPS, L3-DDR Traffic and the dynamic instruction mix based on the counters in the FPU and the cache hierarchy for different compiler optimizations, modes of operations of the system and different L3, L2 configurations for the NAS benchmarks. Our analysis identifies that compiler optimization O5 along with "-qarch440d", which uses the architectural information of the chip in optimization, is very effective in incorporating a lot of SIMD instructions and results in the most efficient execution of the benchmarks. The experiments on the L3 size indicate that an L3 size of 4MB is optimal for the NAS benchmarks and they do not benefit by increasing it further. Also, the virtual node mode of operation of the Blue Gene/P system is very effective and yields superior performance for the selected benchmarks taking advantage of the chip multiprocessor architecture of the quad-core HPC chip.

查看原文本刊更多论文

基于性能计数器的Blue Gene/P工作负载表征

IBM的蓝色基因/P，第二代蓝色基因超级计算机在每个节点上设计了一个通用性能计数器(UPC)单元，能够同时监控256个事件，不像许多微处理器只提供几个性能计数器。在本文中，我们展示了我们开发的接口库的有效性，利用UPC单元，使用户能够毫不费力地测量应用程序，并深入了解其在Blue Gene/P系统上的执行情况，该系统可以扩展到数千个节点。接口库允许用户监视1024个可能事件中的512个与性能相关的事件，并聚合在不同节点收集的数据，并通过数据挖掘计算有意义的度量。使用开发的接口，我们测量了NAS并行基准测试并收集了性能计数器数据。我们研究了MFLOPS、L3- ddr流量和基于FPU中的计数器和缓存层次结构的动态指令混合，用于不同的编译器优化、系统的操作模式和NAS基准的不同L3、L2配置。我们的分析表明，编译器优化O5和“-qarch440d”(在优化中使用芯片的体系结构信息)在合并大量SIMD指令方面非常有效，并能最有效地执行基准测试。关于L3大小的实验表明，对于NAS基准测试来说，4MB的L3大小是最优的，进一步增加它不会带来任何好处。此外，Blue Gene/P系统的虚拟节点操作模式非常有效，并在利用四核高性能计算芯片的芯片多处理器架构的选定基准测试中产生卓越的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2008 37th International Conference on Parallel Processing

自引率

0.00%

发文量