流水线ALU在FPGA中有效的外部存储器访问

Tomás Benes, Michal Kekely, Karel Hynek, T. Čejka
{"title":"流水线ALU在FPGA中有效的外部存储器访问","authors":"Tomás Benes, Michal Kekely, Karel Hynek, T. Čejka","doi":"10.1109/DSD51259.2020.00026","DOIUrl":null,"url":null,"abstract":"The external memories in digital design are closely related to high response time. The most common approach to mitigate latency is adding a caching mechanism into the memory subsystem. This solution might be sufficient in CPU architecture, where we can reschedule operations when a cache miss occurs. However, the FPGA architectures are usually accelerators with simple functionality, where it is not possible to postpone work. The cache miss often leads to whole pipeline stall or even to data loss. The architecture we present in this paper reduces this problem by aggregating arithmetic operations into the memory subsystem itself. Fast data processing is achieved because arithmetic operations working with external data are offloaded. Our architecture reaches a speed of 200 Mp/s (operations carried out). It is designed to be used in systems with link speeds of 100 Gb/s. It outperforms other implementations by a factor of at least 3. The additional benefit of our architecture is reducing the number of memory transactions by a factor of two on real-world datasets.","PeriodicalId":128527,"journal":{"name":"2020 23rd Euromicro Conference on Digital System Design (DSD)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Pipelined ALU for effective external memory access in FPGA\",\"authors\":\"Tomás Benes, Michal Kekely, Karel Hynek, T. Čejka\",\"doi\":\"10.1109/DSD51259.2020.00026\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The external memories in digital design are closely related to high response time. The most common approach to mitigate latency is adding a caching mechanism into the memory subsystem. This solution might be sufficient in CPU architecture, where we can reschedule operations when a cache miss occurs. However, the FPGA architectures are usually accelerators with simple functionality, where it is not possible to postpone work. The cache miss often leads to whole pipeline stall or even to data loss. The architecture we present in this paper reduces this problem by aggregating arithmetic operations into the memory subsystem itself. Fast data processing is achieved because arithmetic operations working with external data are offloaded. Our architecture reaches a speed of 200 Mp/s (operations carried out). It is designed to be used in systems with link speeds of 100 Gb/s. It outperforms other implementations by a factor of at least 3. The additional benefit of our architecture is reducing the number of memory transactions by a factor of two on real-world datasets.\",\"PeriodicalId\":128527,\"journal\":{\"name\":\"2020 23rd Euromicro Conference on Digital System Design (DSD)\",\"volume\":\"21 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 23rd Euromicro Conference on Digital System Design (DSD)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DSD51259.2020.00026\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 23rd Euromicro Conference on Digital System Design (DSD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSD51259.2020.00026","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

数字设计中的外存储器与高响应时间密切相关。缓解延迟的最常见方法是在内存子系统中添加缓存机制。这个解决方案在CPU体系结构中可能是足够的,我们可以在发生缓存丢失时重新安排操作。然而,FPGA架构通常是具有简单功能的加速器,因此不可能推迟工作。缓存丢失通常会导致整个管道停滞甚至数据丢失。我们在本文中提出的架构通过将算术运算聚合到内存子系统本身来减少这个问题。实现了快速数据处理,因为处理外部数据的算术运算被卸载。我们的架构达到200 Mp/s的速度(执行的操作)。它被设计用于连接速度为100gb /s的系统。它的性能至少比其他实现高出3倍。我们架构的另一个好处是在实际数据集上将内存事务的数量减少了两倍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Pipelined ALU for effective external memory access in FPGA
The external memories in digital design are closely related to high response time. The most common approach to mitigate latency is adding a caching mechanism into the memory subsystem. This solution might be sufficient in CPU architecture, where we can reschedule operations when a cache miss occurs. However, the FPGA architectures are usually accelerators with simple functionality, where it is not possible to postpone work. The cache miss often leads to whole pipeline stall or even to data loss. The architecture we present in this paper reduces this problem by aggregating arithmetic operations into the memory subsystem itself. Fast data processing is achieved because arithmetic operations working with external data are offloaded. Our architecture reaches a speed of 200 Mp/s (operations carried out). It is designed to be used in systems with link speeds of 100 Gb/s. It outperforms other implementations by a factor of at least 3. The additional benefit of our architecture is reducing the number of memory transactions by a factor of two on real-world datasets.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信