A Novel Parallel Processing Element Architecture for Accelerating ODE and AI

IF 6.6 1区 计算机科学 Q1 Multidisciplinary
Kaiyuan Yang;Longchao Liu;Haotian Liu;Tiantai Deng
{"title":"A Novel Parallel Processing Element Architecture for Accelerating ODE and AI","authors":"Kaiyuan Yang;Longchao Liu;Haotian Liu;Tiantai Deng","doi":"10.26599/TST.2024.9010090","DOIUrl":null,"url":null,"abstract":"Transforming complex problems, such as transforming ordinary differential equations (ODEs) into matrix formats, into simpler computational tasks is key for AI advancements and paves the way for more efficient computing architectures. Systolic Arrays, known for their computational efficiency, low power use and ease of implementation, address AI's computational challenges. They are central to mainstream industry AI accelerators, with improvements to the Processing Element (PE) significantly boosting systolic array performance, and also streamlines computing architectures, paving the way for more efficient solutions in technology fields. This research presents a novel PE design and its integration of systolic array based on a novel computing theory - bit-level mathematics for Multiply-Accumulate (MAC) operation. We present 3 different architectures for the PE and provide a comprehensive comparison between them and the state-of-the-art technologies, focusing on power, area, and throughput. This research also demonstrates the integration of the proposed MAC unit design with systolic arrays, highlighting significant improvements in computational efficiency. Our implementations show a 2380952.38 times lower latency, yet 64.19 times less DSP48E1, 1.26 times less Look-Up Tables (LUTs), 10.76 times less Flip-Flops (FFs), with 99.63 times less power consumption and 15.19 times higher performance per PE compared to the state-of-the-art design.","PeriodicalId":48690,"journal":{"name":"Tsinghua Science and Technology","volume":"30 5","pages":"1954-1964"},"PeriodicalIF":6.6000,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10979797","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Tsinghua Science and Technology","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10979797/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"Multidisciplinary","Score":null,"Total":0}
引用次数: 0

Abstract

Transforming complex problems, such as transforming ordinary differential equations (ODEs) into matrix formats, into simpler computational tasks is key for AI advancements and paves the way for more efficient computing architectures. Systolic Arrays, known for their computational efficiency, low power use and ease of implementation, address AI's computational challenges. They are central to mainstream industry AI accelerators, with improvements to the Processing Element (PE) significantly boosting systolic array performance, and also streamlines computing architectures, paving the way for more efficient solutions in technology fields. This research presents a novel PE design and its integration of systolic array based on a novel computing theory - bit-level mathematics for Multiply-Accumulate (MAC) operation. We present 3 different architectures for the PE and provide a comprehensive comparison between them and the state-of-the-art technologies, focusing on power, area, and throughput. This research also demonstrates the integration of the proposed MAC unit design with systolic arrays, highlighting significant improvements in computational efficiency. Our implementations show a 2380952.38 times lower latency, yet 64.19 times less DSP48E1, 1.26 times less Look-Up Tables (LUTs), 10.76 times less Flip-Flops (FFs), with 99.63 times less power consumption and 15.19 times higher performance per PE compared to the state-of-the-art design.
一种新的加速ODE和AI的并行处理单元结构
将复杂问题(如将常微分方程(ode)转换为矩阵格式)转换为更简单的计算任务是人工智能进步的关键,并为更高效的计算架构铺平了道路。收缩压阵列以其计算效率、低功耗和易于实现而闻名,解决了人工智能的计算挑战。它们是主流行业人工智能加速器的核心,对处理元件(PE)的改进显著提高了收缩阵列的性能,并简化了计算架构,为技术领域更高效的解决方案铺平了道路。本研究提出了一种新的PE设计及其集成的收缩压阵列,该设计基于一种新颖的计算理论-乘-累加运算的位级数学。我们为PE提供了3种不同的架构,并提供了它们与最先进技术之间的全面比较,重点是功率,面积和吞吐量。本研究还展示了所提出的MAC单元设计与收缩阵列的集成,突出了计算效率的显着提高。我们的实现显示,与最先进的设计相比,延迟降低了2380952.38倍,DSP48E1减少了64.19倍,查找表(lut)减少了1.26倍,触发器(ff)减少了10.76倍,功耗降低了99.63倍,每PE性能提高了15.19倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Tsinghua Science and Technology
Tsinghua Science and Technology COMPUTER SCIENCE, INFORMATION SYSTEMSCOMPU-COMPUTER SCIENCE, SOFTWARE ENGINEERING
CiteScore
10.20
自引率
10.60%
发文量
2340
期刊介绍: Tsinghua Science and Technology (Tsinghua Sci Technol) started publication in 1996. It is an international academic journal sponsored by Tsinghua University and is published bimonthly. This journal aims at presenting the up-to-date scientific achievements in computer science, electronic engineering, and other IT fields. Contributions all over the world are welcome.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信