超低功率集群多核的节能I$设计探索

IEEE Transactions on Multi-Scale Computing Systems Pub Date : 2017-11-02 DOI:10.1109/TMSCS.2017.2769046

Igor Loi;Alessandro Capotondi;Davide Rossi;Andrea Marongiu;Luca Benini

{"title":"超低功率集群多核的节能I$设计探索","authors":"Igor Loi;Alessandro Capotondi;Davide Rossi;Andrea Marongiu;Luca Benini","doi":"10.1109/TMSCS.2017.2769046","DOIUrl":null,"url":null,"abstract":"High performance and extreme energy efficiency are strong requirements for a fast-growing number of edge-node Internet of Things (IoT) applications. While traditional Ultra-Low-Power designs rely on single-core micro-controllers (MCU), a new generation of architectures leveraging fully programmable tightly-coupled clusters of near-threshold processors is emerging, joining the performance gain of parallel execution over multiple cores with the energy efficiency of low-voltage operation. In this work, we tackle one of the most critical energy-efficiency bottlenecks for these architectures: instruction memory hierarchy. Exploiting the instruction locality typical of data-parallel applications, we explore two different shared instruction cache architectures, based on energy-efficient latch-based memory banks: one leveraging a crossbar between processors and single-port banks (SP), and one leveraging banks with multiple read ports (MP). We evaluate the proposed architectures on a set of signal processing applications with different executable sizes and working-sets. The results show that the shared cache architectures are able to efficiently execute a much wider set of applications (including those featuring large memory footprint and irregular access patterns) with a much smaller area and with much better energy efficiency with respect to the private cache. The multi-port cache is suitable for sizes up to a few kB, improving performance by up to 40 percent, energy efficiency by up to 20 percent, and energy × area efficiency by up to 30 percent with respect to the private cache. The single-port solution is more suitable for larger cache sizes (up to 16 kB), providing up to 20 percent better energy x area efficiency than the multi-port, and up to 30 percent better energy efficiency than private cache.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 2","pages":"99-112"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2769046","citationCount":"12","resultStr":"{\"title\":\"The Quest for Energy-Efficient I$ Design in Ultra-Low-Power Clustered Many-Cores\",\"authors\":\"Igor Loi;Alessandro Capotondi;Davide Rossi;Andrea Marongiu;Luca Benini\",\"doi\":\"10.1109/TMSCS.2017.2769046\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"High performance and extreme energy efficiency are strong requirements for a fast-growing number of edge-node Internet of Things (IoT) applications. While traditional Ultra-Low-Power designs rely on single-core micro-controllers (MCU), a new generation of architectures leveraging fully programmable tightly-coupled clusters of near-threshold processors is emerging, joining the performance gain of parallel execution over multiple cores with the energy efficiency of low-voltage operation. In this work, we tackle one of the most critical energy-efficiency bottlenecks for these architectures: instruction memory hierarchy. Exploiting the instruction locality typical of data-parallel applications, we explore two different shared instruction cache architectures, based on energy-efficient latch-based memory banks: one leveraging a crossbar between processors and single-port banks (SP), and one leveraging banks with multiple read ports (MP). We evaluate the proposed architectures on a set of signal processing applications with different executable sizes and working-sets. The results show that the shared cache architectures are able to efficiently execute a much wider set of applications (including those featuring large memory footprint and irregular access patterns) with a much smaller area and with much better energy efficiency with respect to the private cache. The multi-port cache is suitable for sizes up to a few kB, improving performance by up to 40 percent, energy efficiency by up to 20 percent, and energy × area efficiency by up to 30 percent with respect to the private cache. The single-port solution is more suitable for larger cache sizes (up to 16 kB), providing up to 20 percent better energy x area efficiency than the multi-port, and up to 30 percent better energy efficiency than private cache.\",\"PeriodicalId\":100643,\"journal\":{\"name\":\"IEEE Transactions on Multi-Scale Computing Systems\",\"volume\":\"4 2\",\"pages\":\"99-112\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-11-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1109/TMSCS.2017.2769046\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Multi-Scale Computing Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/8094020/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multi-Scale Computing Systems","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/8094020/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

摘要

高性能和极端的能源效率是快速增长的边缘节点物联网（IoT）应用的强烈要求。虽然传统的超低功耗设计依赖于单核微控制器（MCU），但新一代架构正在出现，它利用了近阈值处理器的完全可编程紧密耦合集群，将多核并行执行的性能增益与低电压操作的能效结合起来。在这项工作中，我们解决了这些体系结构中最关键的能效瓶颈之一：指令内存层次结构。利用数据并行应用程序的典型指令局部性，我们探索了两种不同的共享指令缓存体系结构，它们基于节能的基于闩锁的内存组：一种利用处理器和单端口组（SP）之间的交叉开关，另一种利用具有多个读端口的组（MP）。我们在一组具有不同可执行大小和工作集的信号处理应用程序上评估了所提出的架构。结果表明，与私有缓存相比，共享缓存架构能够以更小的面积和更好的能效高效地执行更广泛的应用程序集（包括那些具有大内存占用和不规则访问模式的应用程序）。多端口缓存适用于大小高达几kB的情况，与专用缓存相比，性能提高了40%，能源效率提高了20%，能源×面积效率提高了30%。单端口解决方案更适合较大的缓存大小（高达16kB），比多端口提供高达20%的能量x面积效率，比专用缓存提供高达30%的能量效率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

The Quest for Energy-Efficient I$ Design in Ultra-Low-Power Clustered Many-Cores

High performance and extreme energy efficiency are strong requirements for a fast-growing number of edge-node Internet of Things (IoT) applications. While traditional Ultra-Low-Power designs rely on single-core micro-controllers (MCU), a new generation of architectures leveraging fully programmable tightly-coupled clusters of near-threshold processors is emerging, joining the performance gain of parallel execution over multiple cores with the energy efficiency of low-voltage operation. In this work, we tackle one of the most critical energy-efficiency bottlenecks for these architectures: instruction memory hierarchy. Exploiting the instruction locality typical of data-parallel applications, we explore two different shared instruction cache architectures, based on energy-efficient latch-based memory banks: one leveraging a crossbar between processors and single-port banks (SP), and one leveraging banks with multiple read ports (MP). We evaluate the proposed architectures on a set of signal processing applications with different executable sizes and working-sets. The results show that the shared cache architectures are able to efficiently execute a much wider set of applications (including those featuring large memory footprint and irregular access patterns) with a much smaller area and with much better energy efficiency with respect to the private cache. The multi-port cache is suitable for sizes up to a few kB, improving performance by up to 40 percent, energy efficiency by up to 20 percent, and energy × area efficiency by up to 30 percent with respect to the private cache. The single-port solution is more suitable for larger cache sizes (up to 16 kB), providing up to 20 percent better energy x area efficiency than the multi-port, and up to 30 percent better energy efficiency than private cache.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Multi-Scale Computing Systems

自引率

0.00%

发文量