Plasticine: A Cross-layer Approximation Methodology for Multi-kernel Applications through Minimally Biased, High-throughput, and Energy-efficient SIMD Soft Multiplier-divider

ACM Transactions on Design Automation of Electronic Systems (TODAES) Pub Date : 2021-11-02 DOI:10.1145/3486616

Zahra Ebrahimi, D. Klar, Mohammad Aasim Ekhtiyar, Akash Kumar

{"title":"Plasticine: A Cross-layer Approximation Methodology for Multi-kernel Applications through Minimally Biased, High-throughput, and Energy-efficient SIMD Soft Multiplier-divider","authors":"Zahra Ebrahimi, D. Klar, Mohammad Aasim Ekhtiyar, Akash Kumar","doi":"10.1145/3486616","DOIUrl":null,"url":null,"abstract":"The rapid evolution of error-resilient programs intertwined with their quest for high throughput has motivated the use of Single Instruction, Multiple Data (SIMD) components in Field-Programmable Gate Arrays (FPGAs). Particularly, to exploit the error-resiliency of such applications, Cross-layer approximation paradigm has recently gained traction, the ultimate goal of which is to efficiently exploit approximation potentials across layers of abstraction. From circuit- to application-level, valuable studies have proposed various approximation techniques, albeit linked to four drawbacks: First, most of approximate multipliers and dividers operate only in SISD mode. Second, imprecise units are often substituted, merely in a single kernel of a multi-kernel application, with an end-to-end analysis in Quality of Results (QoR) and not in the gained performance. Third, state-of-the-art (SoA) strategies neglect the fact that each kernel contributes differently to the end-to-end QoR and performance metrics. Therefore, they lack in adopting a generic methodology for adjusting the approximation knobs to maximize performance gains for a user-defined quality constraint. Finally, multi-level techniques lack in being efficiently supported, from application-, to architecture-, to circuit-level, in a cohesive cross-layer hierarchy. In this article, we propose Plasticine, a cross-layer methodology for multi-kernel applications, which addresses the aforementioned challenges by efficiently utilizing the synergistic effects of a chain of techniques across layers of abstraction. To this end, we propose an application sensitivity analysis and a heuristic that tailor the precision at constituent kernels of the application by finding the most tolerable degree of approximations for each of consecutive kernels, while also satisfying the ultimate user-defined QoR. The chain of approximations is also effectively enabled in a cross-layer hierarchy, from application- to architecture- to circuit-level, through the plasticity of SIMD multiplier-dividers, each supporting dynamic precision variability along with hybrid functionality. The end-to-end evaluations of Plasticine on three multi-kernel applications employed in bio-signal processing, image processing, and moving object tracking for Unmanned Air Vehicles (UAV) demonstrate 41%–64%, 39%–62%, and 70%–86% improvements in area, latency, and Area-Delay-Product (ADP), respectively, over 32-bit fixed precision, with negligible loss in QoR. To springboard future research in reconfigurable and approximate computing communities, our implementations will be available and open-sourced at https://cfaed.tu-dresden.de/pd-downloads.","PeriodicalId":6933,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems (TODAES)","volume":"4 1","pages":"1 - 33"},"PeriodicalIF":0.0000,"publicationDate":"2021-11-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Design Automation of Electronic Systems (TODAES)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3486616","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

The rapid evolution of error-resilient programs intertwined with their quest for high throughput has motivated the use of Single Instruction, Multiple Data (SIMD) components in Field-Programmable Gate Arrays (FPGAs). Particularly, to exploit the error-resiliency of such applications, Cross-layer approximation paradigm has recently gained traction, the ultimate goal of which is to efficiently exploit approximation potentials across layers of abstraction. From circuit- to application-level, valuable studies have proposed various approximation techniques, albeit linked to four drawbacks: First, most of approximate multipliers and dividers operate only in SISD mode. Second, imprecise units are often substituted, merely in a single kernel of a multi-kernel application, with an end-to-end analysis in Quality of Results (QoR) and not in the gained performance. Third, state-of-the-art (SoA) strategies neglect the fact that each kernel contributes differently to the end-to-end QoR and performance metrics. Therefore, they lack in adopting a generic methodology for adjusting the approximation knobs to maximize performance gains for a user-defined quality constraint. Finally, multi-level techniques lack in being efficiently supported, from application-, to architecture-, to circuit-level, in a cohesive cross-layer hierarchy. In this article, we propose Plasticine, a cross-layer methodology for multi-kernel applications, which addresses the aforementioned challenges by efficiently utilizing the synergistic effects of a chain of techniques across layers of abstraction. To this end, we propose an application sensitivity analysis and a heuristic that tailor the precision at constituent kernels of the application by finding the most tolerable degree of approximations for each of consecutive kernels, while also satisfying the ultimate user-defined QoR. The chain of approximations is also effectively enabled in a cross-layer hierarchy, from application- to architecture- to circuit-level, through the plasticity of SIMD multiplier-dividers, each supporting dynamic precision variability along with hybrid functionality. The end-to-end evaluations of Plasticine on three multi-kernel applications employed in bio-signal processing, image processing, and moving object tracking for Unmanned Air Vehicles (UAV) demonstrate 41%–64%, 39%–62%, and 70%–86% improvements in area, latency, and Area-Delay-Product (ADP), respectively, over 32-bit fixed precision, with negligible loss in QoR. To springboard future research in reconfigurable and approximate computing communities, our implementations will be available and open-sourced at https://cfaed.tu-dresden.de/pd-downloads.

查看原文本刊更多论文

橡皮泥:通过最小偏置、高通量和高能效SIMD软乘法器实现多核应用的跨层近似方法

纠错弹性程序的快速发展与他们对高吞吐量的追求交织在一起，促使了在现场可编程门阵列(fpga)中使用单指令多数据(SIMD)组件。特别是，为了利用这些应用程序的错误弹性，跨层近似范式最近得到了关注，其最终目标是有效地利用跨抽象层的近似潜力。从电路到应用级，有价值的研究提出了各种近似技术，尽管与四个缺点有关:首先，大多数近似乘法器和除法器仅在SISD模式下工作。其次，仅仅在多内核应用程序的单个内核中，不精确的单元经常被替换为结果质量(QoR)中的端到端分析，而不是获得的性能。第三，最先进的(SoA)策略忽略了这样一个事实，即每个内核对端到端QoR和性能指标的贡献是不同的。因此，他们缺乏采用一种通用的方法来调整近似旋钮，以最大限度地提高用户定义的质量约束的性能。最后，多层次技术缺乏有效的支持，从应用到体系结构，再到电路级，在一个内聚的跨层层次结构中。在本文中，我们提出了Plasticine，这是一种用于多内核应用程序的跨层方法，它通过有效地利用跨抽象层的技术链的协同效应来解决上述挑战。为此，我们提出了应用敏感性分析和启发式方法，通过为每个连续的内核找到最可容忍的近似程度来定制应用程序的组成内核的精度，同时也满足最终的用户定义的QoR。从应用到架构再到电路级，通过SIMD乘法器/分法器的可塑性，在跨层层次结构中也有效地实现了近似链，每个乘法器都支持动态精度变化以及混合功能。在生物信号处理、图像处理和无人机运动目标跟踪的三种多核应用中，对Plasticine的端到端评估表明，在32位固定精度下，面积、延迟和面积延迟积(ADP)分别提高了41%-64%、39%-62%和70%-86%，QoR损失可以忽略不计。为了在可重构和近似计算社区中开展未来的研究，我们的实现将在https://cfaed.tu-dresden.de/pd-downloads上开放源代码。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

ACM Transactions on Design Automation of Electronic Systems (TODAES)

自引率

0.00%

发文量