A certain examination on heterogeneous systolic array (HSA) design for deep learning accelerations with low power computations

IF 3.8 3区 计算机科学 Q1 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE
Dinesh Kumar Jayaraman Rajanediran , C. Ganesh Babu , K. Priyadharsini
{"title":"A certain examination on heterogeneous systolic array (HSA) design for deep learning accelerations with low power computations","authors":"Dinesh Kumar Jayaraman Rajanediran ,&nbsp;C. Ganesh Babu ,&nbsp;K. Priyadharsini","doi":"10.1016/j.suscom.2024.101042","DOIUrl":null,"url":null,"abstract":"<div><div>Acceleration techniques play a crucial role in enhancing the performance of modern high-speed computations, especially in Deep Learning (DL) applications where the speed is of utmost importance. One essential component in this context is the Systolic Array (SA), which effectively handles computational tasks and data processing in a rhythmic manner. Google's Tensor Processing Unit (TPU) leverages the power of SA for neural networks. The core SA's functionality and performance lies in the Computation Element (CE), which facilitates parallel data flow. In our article, we introduce a novel approach called Proposed Systolic Array (PSA), which is implemented on the CE and further enhanced with a modified Hybrid Kogge Stone adder (MHA). This design incorporates principles to expedite computations by rounding and extracting data model in SA as PSA-MHA. The PSA, utilizing a data flow model with MHA, significantly accelerates data shifts and control passes in execution cycles. We validated our approach through simulations on the Cadence Virtuoso platform using 65 nm process technology, comparing it to the General Matrix Multiplication (GMMN) benchmark. The results showed remarkable improvements in the CE, with a 30.29 % reduction in delay, a 23.07 % reduction in area, and an 11.87 % reduction in power consumption. The PSA outperformed these improvements, achieving a 46.38 % reduction in delay, a 7.58 % reduction in area, and an impressive 48.23 % decrease in Area Delay Product (ADP). To further substantiate our findings, we applied the PSA-based approach to pre-trained hybrid Convolutional and Recurrent (CNN-RNN) neural models. The PSA-based hybrid model incorporates 189 million Multiply-Accumulate (MAC) units, resulting in a weighted mean architecture value of 784.80 for the RNN component. We also explored variations in bit width, which led to delay reductions ranging from 20.17 % to 30.29 %, area variations between 13.08 % and 32.16 %, and power consumption changes spanning from 11.88 % to 20.42 %.</div></div>","PeriodicalId":48686,"journal":{"name":"Sustainable Computing-Informatics & Systems","volume":"44 ","pages":"Article 101042"},"PeriodicalIF":3.8000,"publicationDate":"2024-10-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Sustainable Computing-Informatics & Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2210537924000878","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}
引用次数: 0

Abstract

Acceleration techniques play a crucial role in enhancing the performance of modern high-speed computations, especially in Deep Learning (DL) applications where the speed is of utmost importance. One essential component in this context is the Systolic Array (SA), which effectively handles computational tasks and data processing in a rhythmic manner. Google's Tensor Processing Unit (TPU) leverages the power of SA for neural networks. The core SA's functionality and performance lies in the Computation Element (CE), which facilitates parallel data flow. In our article, we introduce a novel approach called Proposed Systolic Array (PSA), which is implemented on the CE and further enhanced with a modified Hybrid Kogge Stone adder (MHA). This design incorporates principles to expedite computations by rounding and extracting data model in SA as PSA-MHA. The PSA, utilizing a data flow model with MHA, significantly accelerates data shifts and control passes in execution cycles. We validated our approach through simulations on the Cadence Virtuoso platform using 65 nm process technology, comparing it to the General Matrix Multiplication (GMMN) benchmark. The results showed remarkable improvements in the CE, with a 30.29 % reduction in delay, a 23.07 % reduction in area, and an 11.87 % reduction in power consumption. The PSA outperformed these improvements, achieving a 46.38 % reduction in delay, a 7.58 % reduction in area, and an impressive 48.23 % decrease in Area Delay Product (ADP). To further substantiate our findings, we applied the PSA-based approach to pre-trained hybrid Convolutional and Recurrent (CNN-RNN) neural models. The PSA-based hybrid model incorporates 189 million Multiply-Accumulate (MAC) units, resulting in a weighted mean architecture value of 784.80 for the RNN component. We also explored variations in bit width, which led to delay reductions ranging from 20.17 % to 30.29 %, area variations between 13.08 % and 32.16 %, and power consumption changes spanning from 11.88 % to 20.42 %.
针对低功耗计算深度学习加速的异构系统阵列(HSA)设计的若干研究
加速技术在提高现代高速计算性能方面发挥着至关重要的作用,尤其是在速度至关重要的深度学习(DL)应用中。在这种情况下,系统阵列(SA)就是一个重要的组成部分,它能以有节奏的方式有效处理计算任务和数据处理。谷歌的张量处理单元(TPU)就利用了SA在神经网络中的强大功能。SA的核心功能和性能在于计算元件(CE),它能促进并行数据流。在我们的文章中,我们介绍了一种名为 "拟议收缩阵列"(PSA)的新方法,它是在 CE 上实现的,并通过改进的混合 Kogge Stone 加法器(MHA)得到了进一步增强。这种设计包含了通过舍入和提取 SA 中的数据模型来加快计算速度的原则,即 PSA-MHA。PSA 利用 MHA 的数据流模型,大大加快了执行周期中的数据转移和控制传递。我们在采用 65 纳米工艺技术的 Cadence Virtuoso 平台上进行了仿真,并将其与通用矩阵乘法 (GMMN) 基准进行了比较,从而验证了我们的方法。结果表明,CE 有了明显改善,延迟减少了 30.29%,面积减少了 23.07%,功耗减少了 11.87%。PSA 的改进幅度超过了这些改进,延迟减少了 46.38%,面积减少了 7.58%,面积延迟积(ADP)减少了 48.23%,令人印象深刻。为了进一步证实我们的研究结果,我们将基于 PSA 的方法应用于预先训练好的混合卷积和递归(CNN-RNN)神经模型。基于 PSA 的混合模型包含 1.89 亿个乘积 (MAC) 单元,因此 RNN 部分的加权平均架构值为 784.80。我们还探索了位宽的变化,结果是延迟降低了 20.17% 到 30.29%,面积变化了 13.08% 到 32.16%,功耗变化了 11.88% 到 20.42%。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Sustainable Computing-Informatics & Systems
Sustainable Computing-Informatics & Systems COMPUTER SCIENCE, HARDWARE & ARCHITECTUREC-COMPUTER SCIENCE, INFORMATION SYSTEMS
CiteScore
10.70
自引率
4.40%
发文量
142
期刊介绍: Sustainable computing is a rapidly expanding research area spanning the fields of computer science and engineering, electrical engineering as well as other engineering disciplines. The aim of Sustainable Computing: Informatics and Systems (SUSCOM) is to publish the myriad research findings related to energy-aware and thermal-aware management of computing resource. Equally important is a spectrum of related research issues such as applications of computing that can have ecological and societal impacts. SUSCOM publishes original and timely research papers and survey articles in current areas of power, energy, temperature, and environment related research areas of current importance to readers. SUSCOM has an editorial board comprising prominent researchers from around the world and selects competitively evaluated peer-reviewed papers.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信