Blue Gene/Q应用性能表征与分析

2012 SC Companion: High Performance Computing, Networking Storage and Analysis Pub Date : 2012-11-10 DOI:10.1109/SC.Companion.2012.358

B. Walkup

{"title":"Blue Gene/Q应用性能表征与分析","authors":"B. Walkup","doi":"10.1109/SC.Companion.2012.358","DOIUrl":null,"url":null,"abstract":"This article consists of a collection of slides from the author's conference presentation. The author concludes that The Blue Gene/Q design, low-power simple cores, four hardware threads per core, resu lts in high instruction throughput, and thus exceptional power efficiency for applications. Can effectively fill in pipeline stalls and hide latencies in the memory subsystem. The consequence is low performance per thread, so a high degree of parallelization is required for high application performance. Traditional programming methods (MPI, OpenMP, Pthreads) hold up at very large scales. Memory costs can limit scaling when there are data-structures with size linear in the number of processes, threading helps by keeping the number of processes manageable. Detailed performance analysis is viable at > 10^6 processes but requires care. On-the-fly performance data reduction has merits.","PeriodicalId":6346,"journal":{"name":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","volume":"77 1","pages":"2247-2280"},"PeriodicalIF":0.0000,"publicationDate":"2012-11-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Application performance characterization and analysis on Blue Gene/Q\",\"authors\":\"B. Walkup\",\"doi\":\"10.1109/SC.Companion.2012.358\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This article consists of a collection of slides from the author's conference presentation. The author concludes that The Blue Gene/Q design, low-power simple cores, four hardware threads per core, resu lts in high instruction throughput, and thus exceptional power efficiency for applications. Can effectively fill in pipeline stalls and hide latencies in the memory subsystem. The consequence is low performance per thread, so a high degree of parallelization is required for high application performance. Traditional programming methods (MPI, OpenMP, Pthreads) hold up at very large scales. Memory costs can limit scaling when there are data-structures with size linear in the number of processes, threading helps by keeping the number of processes manageable. Detailed performance analysis is viable at > 10^6 processes but requires care. On-the-fly performance data reduction has merits.\",\"PeriodicalId\":6346,\"journal\":{\"name\":\"2012 SC Companion: High Performance Computing, Networking Storage and Analysis\",\"volume\":\"77 1\",\"pages\":\"2247-2280\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2012-11-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2012 SC Companion: High Performance Computing, Networking Storage and Analysis\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SC.Companion.2012.358\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2012 SC Companion: High Performance Computing, Networking Storage and Analysis","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SC.Companion.2012.358","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

本文由作者在会议上的演讲幻灯片组成。作者得出结论:Blue Gene/Q设计，低功耗的简单内核，每核四个硬件线程，导致高指令吞吐量，从而为应用程序提供卓越的功耗效率。可以有效地填补管道的停顿和隐藏内存子系统的延迟。其结果是每个线程的性能较低，因此需要高度的并行化来获得较高的应用程序性能。传统的编程方法(MPI、OpenMP、Pthreads)适用于非常大的规模。当数据结构的大小与进程数量呈线性关系时，内存成本可能会限制扩展，线程可以帮助保持进程数量的可管理性。详细的性能分析在bbb10 ^6进程中是可行的，但需要注意。动态性能数据缩减有其优点。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Application performance characterization and analysis on Blue Gene/Q

This article consists of a collection of slides from the author's conference presentation. The author concludes that The Blue Gene/Q design, low-power simple cores, four hardware threads per core, resu lts in high instruction throughput, and thus exceptional power efficiency for applications. Can effectively fill in pipeline stalls and hide latencies in the memory subsystem. The consequence is low performance per thread, so a high degree of parallelization is required for high application performance. Traditional programming methods (MPI, OpenMP, Pthreads) hold up at very large scales. Memory costs can limit scaling when there are data-structures with size linear in the number of processes, threading helps by keeping the number of processes manageable. Detailed performance analysis is viable at > 10^6 processes but requires care. On-the-fly performance data reduction has merits.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2012 SC Companion: High Performance Computing, Networking Storage and Analysis

自引率

0.00%

发文量