Efficient execution of memory access phases using dataflow specialization

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA) Pub Date : 2015-06-13 DOI:10.1145/2749469.2750390

C. Ho, Sung Jin Kim, K. Sankaralingam

{"title":"Efficient execution of memory access phases using dataflow specialization","authors":"C. Ho, Sung Jin Kim, K. Sankaralingam","doi":"10.1145/2749469.2750390","DOIUrl":null,"url":null,"abstract":"This paper identifies a new opportunity for improving the efficiency of a processor core: memory access phases of programs. These are dynamic regions of programs where most of the instructions are devoted to memory access or address computation. These occur naturally in programs because of workload properties, or when employing an in-core accelerator, we get induced phases where the code execution on the core is access code. We observe such code requires an OOO core's dataflow and dynamism to run fast and does not execute well on an in-order processor. However, an OOO core consumes much power, effectively increasing energy consumption and reducing the energy efficiency of in-core accelerators. We develop an execution model called memory access dataflow (MAD) that encodes dataflow computation, event-condition-action rules, and explicit actions. Using it we build a specialized engine that provides an OOO core's performance but at a fraction of the power. Such an engine can serve as a general way for any accelerator to execute its respective induced phase, thus providing a common interface and implementation for current and future accelerators. We have designed and implemented MAD in RTL, and we demonstrate its generality and flexibility by integration with four diverse accelerators (SSE, DySER, NPU, and C-Cores). Our quantitative results show, relative to in-order, 2-wide OOO, and 4-wide OOO, MAD provides 2.4×, 1.4× and equivalent performance respectively. It provides 0.8×, 0.6× and 0.4× lower energy.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"16 1","pages":"118-130"},"PeriodicalIF":0.0000,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"30","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2749469.2750390","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 30

Abstract

This paper identifies a new opportunity for improving the efficiency of a processor core: memory access phases of programs. These are dynamic regions of programs where most of the instructions are devoted to memory access or address computation. These occur naturally in programs because of workload properties, or when employing an in-core accelerator, we get induced phases where the code execution on the core is access code. We observe such code requires an OOO core's dataflow and dynamism to run fast and does not execute well on an in-order processor. However, an OOO core consumes much power, effectively increasing energy consumption and reducing the energy efficiency of in-core accelerators. We develop an execution model called memory access dataflow (MAD) that encodes dataflow computation, event-condition-action rules, and explicit actions. Using it we build a specialized engine that provides an OOO core's performance but at a fraction of the power. Such an engine can serve as a general way for any accelerator to execute its respective induced phase, thus providing a common interface and implementation for current and future accelerators. We have designed and implemented MAD in RTL, and we demonstrate its generality and flexibility by integration with four diverse accelerators (SSE, DySER, NPU, and C-Cores). Our quantitative results show, relative to in-order, 2-wide OOO, and 4-wide OOO, MAD provides 2.4×, 1.4× and equivalent performance respectively. It provides 0.8×, 0.6× and 0.4× lower energy.

查看原文本刊更多论文

使用数据流专门化有效地执行内存访问阶段

本文提出了一个提高处理器核心效率的新机会:程序的存储器访问阶段。这些是程序的动态区域，其中大部分指令用于内存访问或地址计算。由于工作负载属性，这些情况在程序中很自然地发生，或者当使用核心内加速器时，我们会得到诱导阶段，其中核心上的代码执行是访问代码。我们观察到这样的代码需要OOO核心的数据流和动态才能快速运行，并且在有序处理器上执行得不好。但是，OOO内核消耗大量的功率，有效地增加了能量消耗，降低了内核内加速器的能量效率。我们开发了一个称为内存访问数据流(MAD)的执行模型，该模型对数据流计算、事件-条件-操作规则和显式操作进行编码。使用它，我们构建了一个专门的引擎，它提供了OOO核心的性能，但功率只是它的一小部分。这样的引擎可以作为任何加速器执行其各自诱导阶段的通用方式，从而为当前和未来的加速器提供通用接口和实现。我们在RTL中设计并实现了MAD，并通过集成四种不同的加速器(SSE, DySER, NPU和C-Cores)来展示其通用性和灵活性。我们的定量结果表明，相对于有序的2宽OOO和4宽OOO, MAD分别提供2.4倍、1.4倍和等效的性能。提供0.8倍、0.6倍、0.4倍的低能量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)

自引率

0.00%

发文量