Dynamically Specialized Datapaths for energy efficient computing

2011 IEEE 17th International Symposium on High Performance Computer Architecture Pub Date : 2011-02-12 DOI:10.1109/HPCA.2011.5749755

Venkatraman Govindaraju, C. Ho, K. Sankaralingam

{"title":"Dynamically Specialized Datapaths for energy efficient computing","authors":"Venkatraman Govindaraju, C. Ho, K. Sankaralingam","doi":"10.1109/HPCA.2011.5749755","DOIUrl":null,"url":null,"abstract":"Due to limits in technology scaling, energy efficiency of logic devices is decreasing in successive generations. To provide continued performance improvements without increasing power, regardless of the sequential or parallel nature of the application, microarchitectural energy efficiency must improve. We propose Dynamically Specialized Datapaths to improve the energy efficiency of general purpose programmable processors. The key insights of this work are the following. First, applications execute in phases and these phases can be determined by creating a path-tree of basic-blocks rooted at the inner-most loop. Second, specialized datapaths corresponding to these path-trees, which we refer to as DySER blocks, can be constructed by interconnecting a set of heterogeneous computation units with a circuit-switched network. These blocks can be easily integrated with a processor pipeline. A synthesized RTL implementation using an industry 55nm technology library shows a 64-functional-unit DySER block occupies approximately the same area as a 64 KB single-ported SRAM and can execute at 2 GHz. We extend the GCC compiler to identify path-trees and code-mapping to DySER and evaluate the PAR-SEC, SPEC and Parboil benchmarks suites. Our results show that in most cases two DySER blocks can achieve the same performance (within 5%) as having a specialized hardware module for each path-tree. A 64-FU DySER block can cover 12% to 100% of the dynamically executed instruction stream. When integrated with a dual-issue out-of-order processor, two DySER blocks provide geometric mean speedup of 2.1X (1.15X to 10X), and geometric mean energy reduction of 40% (up to 70%), and 60% energy reduction if no performance improvement is required.","PeriodicalId":126976,"journal":{"name":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","volume":"35 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-02-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"218","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE 17th International Symposium on High Performance Computer Architecture","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCA.2011.5749755","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 218

Abstract

Due to limits in technology scaling, energy efficiency of logic devices is decreasing in successive generations. To provide continued performance improvements without increasing power, regardless of the sequential or parallel nature of the application, microarchitectural energy efficiency must improve. We propose Dynamically Specialized Datapaths to improve the energy efficiency of general purpose programmable processors. The key insights of this work are the following. First, applications execute in phases and these phases can be determined by creating a path-tree of basic-blocks rooted at the inner-most loop. Second, specialized datapaths corresponding to these path-trees, which we refer to as DySER blocks, can be constructed by interconnecting a set of heterogeneous computation units with a circuit-switched network. These blocks can be easily integrated with a processor pipeline. A synthesized RTL implementation using an industry 55nm technology library shows a 64-functional-unit DySER block occupies approximately the same area as a 64 KB single-ported SRAM and can execute at 2 GHz. We extend the GCC compiler to identify path-trees and code-mapping to DySER and evaluate the PAR-SEC, SPEC and Parboil benchmarks suites. Our results show that in most cases two DySER blocks can achieve the same performance (within 5%) as having a specialized hardware module for each path-tree. A 64-FU DySER block can cover 12% to 100% of the dynamically executed instruction stream. When integrated with a dual-issue out-of-order processor, two DySER blocks provide geometric mean speedup of 2.1X (1.15X to 10X), and geometric mean energy reduction of 40% (up to 70%), and 60% energy reduction if no performance improvement is required.

查看原文本刊更多论文

用于节能计算的动态专用数据路径

由于技术规模的限制，逻辑器件的能量效率在不断下降。为了在不增加功率的情况下提供持续的性能改进，无论应用程序是顺序的还是并行的，都必须提高微架构的能效。为了提高通用可编程处理器的能效，我们提出了动态专用数据路径。这项工作的关键见解如下。首先，应用程序分阶段执行，这些阶段可以通过创建植根于最内层循环的基本块的路径树来确定。其次，与这些路径树相对应的专用数据路径，我们称之为DySER块，可以通过将一组异构计算单元与电路交换网络互连来构建。这些块可以很容易地与处理器管道集成。使用工业55nm技术库的综合RTL实现显示，64个功能单元的d斯勒块占用与64 KB单端口SRAM大致相同的面积，并且可以在2 GHz下执行。我们扩展了GCC编译器，以识别路径树和代码映射到dser，并评估PAR-SEC、SPEC和Parboil基准套件。我们的结果表明，在大多数情况下，两个DySER块可以达到与为每个路径树使用专用硬件模块相同的性能(在5%以内)。一个64-FU的daser块可以覆盖12%到100%的动态执行指令流。当与双问题无序处理器集成时，两个DySER块提供2.1倍(1.15倍至10倍)的几何平均加速，几何平均能耗降低40%(高达70%)，如果不需要性能改进，则能耗降低60%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 IEEE 17th International Symposium on High Performance Computer Architecture

自引率

0.00%

发文量