Sharing the instruction cache among lean cores on an asymmetric CMP for HPC applications

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS) Pub Date : 2017-04-24 DOI:10.1109/ISPASS.2017.7975265

Ugljesa Milic, Alejandro Rico, P. Carpenter, Alex Ramírez

{"title":"Sharing the instruction cache among lean cores on an asymmetric CMP for HPC applications","authors":"Ugljesa Milic, Alejandro Rico, P. Carpenter, Alex Ramírez","doi":"10.1109/ISPASS.2017.7975265","DOIUrl":null,"url":null,"abstract":"High performance computing (HPC) applications have parallel code sections that must scale to large numbers of cores, which makes them sensitive to serial regions. Current supercomputing systems with heterogeneous or asymmetric CMPs (ACMP) combine few high-performance big cores for serial regions, together with many low-power lean cores for throughput computing. The low requirements of HPC applications in the core front-end lead some designs, such as SMT and GPU cores, to share front-end structures including the instruction cache (I-cache). However, little work exists to analyze the benefit of sharing the I-cache among full cores, which seems compelling as a solution to reduce silicon area and power. This paper analyzes the performance, power and area impact of such a design on an ACMP with one high-performance core and multiple low-power cores. Having identified that multiple cores run the same code during parallel regions, the lean cores share the I-cache with the intent of benefiting from mutual prefetching, without increasing the average access latency. Our exploration of the multiple parameters finds the sweet spot on a wide interconnect to access the shared I-cache and the inclusion of a few line buffers to provide the required bandwidth and latency to sustain performance. The projections with McPAT and a rich set of HPC benchmarks show 11% area savings with a 5% energy reduction at no performance cost.","PeriodicalId":123307,"journal":{"name":"2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","volume":"99 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISPASS.2017.7975265","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

High performance computing (HPC) applications have parallel code sections that must scale to large numbers of cores, which makes them sensitive to serial regions. Current supercomputing systems with heterogeneous or asymmetric CMPs (ACMP) combine few high-performance big cores for serial regions, together with many low-power lean cores for throughput computing. The low requirements of HPC applications in the core front-end lead some designs, such as SMT and GPU cores, to share front-end structures including the instruction cache (I-cache). However, little work exists to analyze the benefit of sharing the I-cache among full cores, which seems compelling as a solution to reduce silicon area and power. This paper analyzes the performance, power and area impact of such a design on an ACMP with one high-performance core and multiple low-power cores. Having identified that multiple cores run the same code during parallel regions, the lean cores share the I-cache with the intent of benefiting from mutual prefetching, without increasing the average access latency. Our exploration of the multiple parameters finds the sweet spot on a wide interconnect to access the shared I-cache and the inclusion of a few line buffers to provide the required bandwidth and latency to sustain performance. The projections with McPAT and a rich set of HPC benchmarks show 11% area savings with a 5% energy reduction at no performance cost.

查看原文本刊更多论文

在HPC应用程序的非对称CMP上在精益核之间共享指令缓存

高性能计算(HPC)应用程序具有并行代码段，必须扩展到大量核心，这使得它们对串行区域非常敏感。当前具有异构或非对称cmp (ACMP)的超级计算系统结合了少量用于串行区域的高性能大核，以及许多用于吞吐量计算的低功耗精益核。高性能计算应用对核心前端的低要求导致一些设计，如SMT和GPU核心，共享前端结构，包括指令缓存(I-cache)。然而，很少有研究分析在全核之间共享I-cache的好处，这似乎是一种引人注目的减少硅面积和功耗的解决方案。本文分析了这种设计对一个高性能核心和多个低功耗核心的ACMP的性能、功耗和面积的影响。在确定多个核心在并行区域运行相同的代码后，精益核心共享I-cache，目的是从相互预取中获益，而不会增加平均访问延迟。我们对多个参数的研究发现，在宽互连上访问共享I-cache和包含一些行缓冲区以提供维持性能所需的带宽和延迟的最佳点。McPAT和一套丰富的高性能计算基准的预测显示，在没有性能成本的情况下，节省了11%的面积，减少了5%的能源。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS)

自引率

0.00%

发文量