图形：在高速缓存中协调地进行收集和处理，实现高度并行性和灵活性

IF 5.1 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Emerging Topics in Computing Pub Date : 2023-07-17 DOI:10.1109/TETC.2023.3290683

Yiming Chen;Mingyen Lee;Guohao Dai;Mufeng Zhou;Nagadastagiri Challapalle;Tianyi Wang;Yao Yu;Yongpan Liu;Yu Wang;Huazhong Yang;Vijaykrishnan Narayanan;Xueqing Li

{"title":"图形：在高速缓存中协调地进行收集和处理，实现高度并行性和灵活性","authors":"Yiming Chen;Mingyen Lee;Guohao Dai;Mufeng Zhou;Nagadastagiri Challapalle;Tianyi Wang;Yao Yu;Yongpan Liu;Yu Wang;Huazhong Yang;Vijaykrishnan Narayanan;Xueqing Li","doi":"10.1109/TETC.2023.3290683","DOIUrl":null,"url":null,"abstract":"In-memory computing (IMC) has been proposed to overcome the von Neumann bottleneck in data-intensive applications. However, existing IMC solutions could not achieve both high parallelism and high flexibility, which limits their application in more general scenarios: As a highly parallel IMC design, the functionality of a MAC crossbar is limited to the matrix-vector multiplication; Another IMC method of logic-in-memory (LiM) is more flexible in supporting different logic functions, but has low parallelism. To improve the LiM parallelism, we are inspired by investigating how the single-instruction, multiple-data (SIMD) instruction set in conventional CPU could potentially help to expand the number of LiM operands in one cycle. The biggest challenge is the inefficiency in handling non-continuous data in parallel due to the SIMD limitation of (i) continuous address, (ii) limited cache bandwidth, and (iii) large full-resolution parallel computing overheads. This article presents GRAPHIC, the first reported in-memory SIMD architecture that solves the parallelism and irregular data access challenges in applying SIMD to LiM. GRAPHIC exploits content-addressable memory (CAM) and row-wise-accessible SRAM. By providing the in-situ, full-parallelism, and low-overhead operations of address search, cache read-compute-and-update, GRAPHIC accomplishes high-efficiency gather and aggregation with high parallelism, high energy efficiency, low latency, and low area overheads. Experiments in both continuous data access and irregular data pattern applications show an average speedup of 5x over iso-area AVX-like LiM, and 3-5x over the emerging CAM-based accelerators of CAPE and GaaS-X in advanced techniques.","PeriodicalId":13156,"journal":{"name":"IEEE Transactions on Emerging Topics in Computing","volume":"12 1","pages":"84-96"},"PeriodicalIF":5.1000,"publicationDate":"2023-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"GRAPHIC: Gather and Process Harmoniously in the Cache With High Parallelism and Flexibility\",\"authors\":\"Yiming Chen;Mingyen Lee;Guohao Dai;Mufeng Zhou;Nagadastagiri Challapalle;Tianyi Wang;Yao Yu;Yongpan Liu;Yu Wang;Huazhong Yang;Vijaykrishnan Narayanan;Xueqing Li\",\"doi\":\"10.1109/TETC.2023.3290683\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In-memory computing (IMC) has been proposed to overcome the von Neumann bottleneck in data-intensive applications. However, existing IMC solutions could not achieve both high parallelism and high flexibility, which limits their application in more general scenarios: As a highly parallel IMC design, the functionality of a MAC crossbar is limited to the matrix-vector multiplication; Another IMC method of logic-in-memory (LiM) is more flexible in supporting different logic functions, but has low parallelism. To improve the LiM parallelism, we are inspired by investigating how the single-instruction, multiple-data (SIMD) instruction set in conventional CPU could potentially help to expand the number of LiM operands in one cycle. The biggest challenge is the inefficiency in handling non-continuous data in parallel due to the SIMD limitation of (i) continuous address, (ii) limited cache bandwidth, and (iii) large full-resolution parallel computing overheads. This article presents GRAPHIC, the first reported in-memory SIMD architecture that solves the parallelism and irregular data access challenges in applying SIMD to LiM. GRAPHIC exploits content-addressable memory (CAM) and row-wise-accessible SRAM. By providing the in-situ, full-parallelism, and low-overhead operations of address search, cache read-compute-and-update, GRAPHIC accomplishes high-efficiency gather and aggregation with high parallelism, high energy efficiency, low latency, and low area overheads. Experiments in both continuous data access and irregular data pattern applications show an average speedup of 5x over iso-area AVX-like LiM, and 3-5x over the emerging CAM-based accelerators of CAPE and GaaS-X in advanced techniques.\",\"PeriodicalId\":13156,\"journal\":{\"name\":\"IEEE Transactions on Emerging Topics in Computing\",\"volume\":\"12 1\",\"pages\":\"84-96\"},\"PeriodicalIF\":5.1000,\"publicationDate\":\"2023-07-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Emerging Topics in Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10184175/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Emerging Topics in Computing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10184175/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

为了克服数据密集型应用中的冯-诺依曼瓶颈，人们提出了内存计算（IMC）方案。然而，现有的 IMC 解决方案无法同时实现高并行性和高灵活性，这限制了它们在更广泛应用场景中的应用：作为一种高度并行的 IMC 设计，MAC 横条的功能仅限于矩阵-向量乘法；另一种 IMC 方法--内存逻辑（LiM）在支持不同逻辑功能方面更加灵活，但并行性较低。为了提高 LiM 的并行性，我们受到启发，研究传统 CPU 中的单指令多数据（SIMD）指令集如何可能有助于在一个周期内扩展 LiM 操作数。由于 SIMD 存在以下限制：（1）连续地址；（2）有限的高速缓存带宽；（3）较大的全分辨率并行计算开销，因此最大的挑战是并行处理非连续数据的效率低下。本文介绍了 GRAPHIC，这是首个报道的内存 SIMD 架构，它解决了将 SIMD 应用于 LiM 的并行性和不规则数据访问难题。GRAPHIC 利用了内容可寻址内存 (CAM) 和行向可访问 SRAM。通过提供原位、全并行和低开销的地址搜索、高速缓存读取-计算-更新操作，GRAPHIC 以高并行性、高能效、低延迟和低面积开销实现了高效率的收集和聚合。在连续数据访问和不规则数据模式应用中进行的实验表明，GRAPHIC 比等面积 AVX 类 LiM 平均提速 5 倍，比 CAPE 和 GaaS-X 等新兴基于 CAM 的高级技术加速器提速 3-5 倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

GRAPHIC: Gather and Process Harmoniously in the Cache With High Parallelism and Flexibility

In-memory computing (IMC) has been proposed to overcome the von Neumann bottleneck in data-intensive applications. However, existing IMC solutions could not achieve both high parallelism and high flexibility, which limits their application in more general scenarios: As a highly parallel IMC design, the functionality of a MAC crossbar is limited to the matrix-vector multiplication; Another IMC method of logic-in-memory (LiM) is more flexible in supporting different logic functions, but has low parallelism. To improve the LiM parallelism, we are inspired by investigating how the single-instruction, multiple-data (SIMD) instruction set in conventional CPU could potentially help to expand the number of LiM operands in one cycle. The biggest challenge is the inefficiency in handling non-continuous data in parallel due to the SIMD limitation of (i) continuous address, (ii) limited cache bandwidth, and (iii) large full-resolution parallel computing overheads. This article presents GRAPHIC, the first reported in-memory SIMD architecture that solves the parallelism and irregular data access challenges in applying SIMD to LiM. GRAPHIC exploits content-addressable memory (CAM) and row-wise-accessible SRAM. By providing the in-situ, full-parallelism, and low-overhead operations of address search, cache read-compute-and-update, GRAPHIC accomplishes high-efficiency gather and aggregation with high parallelism, high energy efficiency, low latency, and low area overheads. Experiments in both continuous data access and irregular data pattern applications show an average speedup of 5x over iso-area AVX-like LiM, and 3-5x over the emerging CAM-based accelerators of CAPE and GaaS-X in advanced techniques.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Emerging Topics in Computing Computer Science-Computer Science (miscellaneous)

CiteScore

12.10

自引率

5.10%

发文量

113

期刊介绍： IEEE Transactions on Emerging Topics in Computing publishes papers on emerging aspects of computer science, computing technology, and computing applications not currently covered by other IEEE Computer Society Transactions. Some examples of emerging topics in computing include: IT for Green, Synthetic and organic computing structures and systems, Advanced analytics, Social/occupational computing, Location-based/client computer systems, Morphic computer design, Electronic game systems, & Health-care IT.