Exploiting Private Local Memories to Reduce the Opportunity Cost of Accelerator Integration

Proceedings of the 2016 International Conference on Supercomputing Pub Date : 2016-06-01 DOI:10.1145/2925426.2926258

E. G. Cota, Paolo Mantovani, L. Carloni

{"title":"Exploiting Private Local Memories to Reduce the Opportunity Cost of Accelerator Integration","authors":"E. G. Cota, Paolo Mantovani, L. Carloni","doi":"10.1145/2925426.2926258","DOIUrl":null,"url":null,"abstract":"We present Roca, a technique to reduce the opportunity cost of integrating non-programmable, high-throughput accelerators in general-purpose architectures. Roca exploits the insight that non-programmable accelerators are mostly made of private local memories (PLMs), which are key to the accelerators' performance and energy efficiency. Roca transparently exposes PLMs of otherwise unused accelerators to the cache substrate, thereby allowing the system to extract utility from accelerators even when they cannot directly speed up the system's workload. Roca adds low complexity to existing accelerator designs, requires minimal modifications to the cache substrate, and incurs a modest area overhead that is almost entirely due to additional tag storage. We quantify the utility of Roca by comparing the returns of investing area in either regular last-level cache banks or Roca-enabled accelerators. Through simulation of non-accelerated multiprogrammed workloads on a 16-core system, we extend a 2MB S-NUCA baseline system to show that a 6MB Roca-enabled last-level cache built upon typical accelerators (i.e. whose area is 66% memory) can, on average, realize 70% of the performance and 68% of the energy efficiency benefits of a same-area 8MB S-NUCA configuration, in addition to the potential orders-of-magnitude efficiency and performance improvements that the added accelerators provide to workloads suitable for acceleration.","PeriodicalId":422112,"journal":{"name":"Proceedings of the 2016 International Conference on Supercomputing","volume":"62 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2016 International Conference on Supercomputing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2925426.2926258","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

We present Roca, a technique to reduce the opportunity cost of integrating non-programmable, high-throughput accelerators in general-purpose architectures. Roca exploits the insight that non-programmable accelerators are mostly made of private local memories (PLMs), which are key to the accelerators' performance and energy efficiency. Roca transparently exposes PLMs of otherwise unused accelerators to the cache substrate, thereby allowing the system to extract utility from accelerators even when they cannot directly speed up the system's workload. Roca adds low complexity to existing accelerator designs, requires minimal modifications to the cache substrate, and incurs a modest area overhead that is almost entirely due to additional tag storage. We quantify the utility of Roca by comparing the returns of investing area in either regular last-level cache banks or Roca-enabled accelerators. Through simulation of non-accelerated multiprogrammed workloads on a 16-core system, we extend a 2MB S-NUCA baseline system to show that a 6MB Roca-enabled last-level cache built upon typical accelerators (i.e. whose area is 66% memory) can, on average, realize 70% of the performance and 68% of the energy efficiency benefits of a same-area 8MB S-NUCA configuration, in addition to the potential orders-of-magnitude efficiency and performance improvements that the added accelerators provide to workloads suitable for acceleration.

查看原文本刊更多论文

利用私有局部内存降低加速器集成的机会成本

我们提出Roca，一种降低在通用架构中集成非可编程、高吞吐量加速器的机会成本的技术。Roca利用了不可编程加速器主要由私有本地存储器(plm)组成的洞察力，这是加速器性能和能源效率的关键。Roca透明地将未使用加速器的plm暴露给缓存基板，从而允许系统从加速器中提取效用，即使它们不能直接加速系统的工作负载。Roca为现有加速器设计增加了较低的复杂性，只需对缓存基板进行最小的修改，并且几乎完全由于额外的标签存储而产生适度的面积开销。我们通过比较在常规的最后一级缓存库或启用Roca的加速器中投资区域的回报来量化Roca的效用。通过在16核系统上模拟非加速的多程序工作负载，我们扩展了2MB S-NUCA基线系统，以表明在典型加速器(即其面积为66%的内存)上构建的6MB roca支持的最后一级缓存平均可以实现70%的性能和68%的能效效益相同面积8MB S-NUCA配置。除了添加的加速器为适合加速的工作负载提供潜在的数量级效率和性能改进之外。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2016 International Conference on Supercomputing

自引率

0.00%

发文量