正在进行的工作:NoRF:紧耦合加速器中寄存器文件操作数的一种情况

2022 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES) Pub Date : 2022-10-01 DOI:10.1109/CASES55004.2022.00028

David J. Schlais, Heng Zhuo, Mikko H. Lipasti

{"title":"正在进行的工作:NoRF:紧耦合加速器中寄存器文件操作数的一种情况","authors":"David J. Schlais, Heng Zhuo, Mikko H. Lipasti","doi":"10.1109/CASES55004.2022.00028","DOIUrl":null,"url":null,"abstract":"Accelerators are often used to increase performance and/or energy efficiency of general-purpose CPUs. However, Tightly-Coupled Accelerators (TCAs) often perform computations on data structures that may not be a natural fit for general-purpose registers. The designer can either use the existing register file (RF), a RF tailored for the accelerator, or eschew a RF entirely (NoRF), accessing operands directly from the memory hierarchy. Designers for embedded and edge devices are particularly conscientious towards energy-efficient compute and data transfer. We explore the possibility of mini-DGEMM accelerators (example TCAs) within the context of CPUs and edge devices, which also have increasing applications for DGEMM compute. At a high level, register files help reduce memory accesses (steps 1, 2, 5, and 6 in Figure 1 ) when the compiler finds reuse of operands in the program dataflow. On the other hand, direct memory access simplifies the data movement by completely eliminating the intermediate reads and writes to a register file but issues more memory requests. This paper evaluates the difference between these options of operand delivery. Figure 2 shows that all recent vector extensions use a register file implementation. By this trend, it may seem natural to incorporate mini-matrices into the RF. However, we present quantitative and qualitative evidence to advocate for direct cache access for operands.","PeriodicalId":331181,"journal":{"name":"2022 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Work-in-Progress: NoRF: A Case Against Register File Operands in Tightly-Coupled Accelerators\",\"authors\":\"David J. Schlais, Heng Zhuo, Mikko H. Lipasti\",\"doi\":\"10.1109/CASES55004.2022.00028\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Accelerators are often used to increase performance and/or energy efficiency of general-purpose CPUs. However, Tightly-Coupled Accelerators (TCAs) often perform computations on data structures that may not be a natural fit for general-purpose registers. The designer can either use the existing register file (RF), a RF tailored for the accelerator, or eschew a RF entirely (NoRF), accessing operands directly from the memory hierarchy. Designers for embedded and edge devices are particularly conscientious towards energy-efficient compute and data transfer. We explore the possibility of mini-DGEMM accelerators (example TCAs) within the context of CPUs and edge devices, which also have increasing applications for DGEMM compute. At a high level, register files help reduce memory accesses (steps 1, 2, 5, and 6 in Figure 1 ) when the compiler finds reuse of operands in the program dataflow. On the other hand, direct memory access simplifies the data movement by completely eliminating the intermediate reads and writes to a register file but issues more memory requests. This paper evaluates the difference between these options of operand delivery. Figure 2 shows that all recent vector extensions use a register file implementation. By this trend, it may seem natural to incorporate mini-matrices into the RF. However, we present quantitative and qualitative evidence to advocate for direct cache access for operands.\",\"PeriodicalId\":331181,\"journal\":{\"name\":\"2022 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES)\",\"volume\":\"19 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CASES55004.2022.00028\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CASES55004.2022.00028","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

加速器通常用于提高通用cpu的性能和/或能效。然而，紧耦合加速器(tca)经常对可能不适合通用寄存器的数据结构执行计算。设计人员可以使用现有的寄存器文件(RF)，即为加速器量身定制的RF，或者完全避开RF (NoRF)，直接从内存层次结构访问操作数。嵌入式和边缘设备的设计人员特别注重节能计算和数据传输。我们探索了微型DGEMM加速器(例如tca)在cpu和边缘设备背景下的可能性，它们也有越来越多的DGEMM计算应用。在较高的层次上，当编译器发现程序数据流中重用操作数时，寄存器文件有助于减少内存访问(图1中的步骤1、2、5和6)。另一方面，直接内存访问通过完全消除对寄存器文件的中间读写来简化数据移动，但会发出更多的内存请求。本文评估了这些操作数交付选项之间的差异。图2显示了所有最近的向量扩展都使用了一个寄存器文件实现。根据这一趋势，将微型矩阵纳入RF似乎是很自然的。然而，我们提出了定量和定性的证据来支持对操作数的直接缓存访问。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Work-in-Progress: NoRF: A Case Against Register File Operands in Tightly-Coupled Accelerators

Accelerators are often used to increase performance and/or energy efficiency of general-purpose CPUs. However, Tightly-Coupled Accelerators (TCAs) often perform computations on data structures that may not be a natural fit for general-purpose registers. The designer can either use the existing register file (RF), a RF tailored for the accelerator, or eschew a RF entirely (NoRF), accessing operands directly from the memory hierarchy. Designers for embedded and edge devices are particularly conscientious towards energy-efficient compute and data transfer. We explore the possibility of mini-DGEMM accelerators (example TCAs) within the context of CPUs and edge devices, which also have increasing applications for DGEMM compute. At a high level, register files help reduce memory accesses (steps 1, 2, 5, and 6 in Figure 1 ) when the compiler finds reuse of operands in the program dataflow. On the other hand, direct memory access simplifies the data movement by completely eliminating the intermediate reads and writes to a register file but issues more memory requests. This paper evaluates the difference between these options of operand delivery. Figure 2 shows that all recent vector extensions use a register file implementation. By this trend, it may seem natural to incorporate mini-matrices into the RF. However, we present quantitative and qualitative evidence to advocate for direct cache access for operands.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 International Conference on Compilers, Architecture, and Synthesis for Embedded Systems (CASES)

自引率

0.00%

发文量