流水线ALU在FPGA中有效的外部存储器访问

2020 23rd Euromicro Conference on Digital System Design (DSD) Pub Date : 2020-08-01 DOI:10.1109/DSD51259.2020.00026

Tomás Benes, Michal Kekely, Karel Hynek, T. Čejka

{"title":"流水线ALU在FPGA中有效的外部存储器访问","authors":"Tomás Benes, Michal Kekely, Karel Hynek, T. Čejka","doi":"10.1109/DSD51259.2020.00026","DOIUrl":null,"url":null,"abstract":"The external memories in digital design are closely related to high response time. The most common approach to mitigate latency is adding a caching mechanism into the memory subsystem. This solution might be sufficient in CPU architecture, where we can reschedule operations when a cache miss occurs. However, the FPGA architectures are usually accelerators with simple functionality, where it is not possible to postpone work. The cache miss often leads to whole pipeline stall or even to data loss. The architecture we present in this paper reduces this problem by aggregating arithmetic operations into the memory subsystem itself. Fast data processing is achieved because arithmetic operations working with external data are offloaded. Our architecture reaches a speed of 200 Mp/s (operations carried out). It is designed to be used in systems with link speeds of 100 Gb/s. It outperforms other implementations by a factor of at least 3. The additional benefit of our architecture is reducing the number of memory transactions by a factor of two on real-world datasets.","PeriodicalId":128527,"journal":{"name":"2020 23rd Euromicro Conference on Digital System Design (DSD)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Pipelined ALU for effective external memory access in FPGA\",\"authors\":\"Tomás Benes, Michal Kekely, Karel Hynek, T. Čejka\",\"doi\":\"10.1109/DSD51259.2020.00026\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The external memories in digital design are closely related to high response time. The most common approach to mitigate latency is adding a caching mechanism into the memory subsystem. This solution might be sufficient in CPU architecture, where we can reschedule operations when a cache miss occurs. However, the FPGA architectures are usually accelerators with simple functionality, where it is not possible to postpone work. The cache miss often leads to whole pipeline stall or even to data loss. The architecture we present in this paper reduces this problem by aggregating arithmetic operations into the memory subsystem itself. Fast data processing is achieved because arithmetic operations working with external data are offloaded. Our architecture reaches a speed of 200 Mp/s (operations carried out). It is designed to be used in systems with link speeds of 100 Gb/s. It outperforms other implementations by a factor of at least 3. The additional benefit of our architecture is reducing the number of memory transactions by a factor of two on real-world datasets.\",\"PeriodicalId\":128527,\"journal\":{\"name\":\"2020 23rd Euromicro Conference on Digital System Design (DSD)\",\"volume\":\"21 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 23rd Euromicro Conference on Digital System Design (DSD)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DSD51259.2020.00026\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 23rd Euromicro Conference on Digital System Design (DSD)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DSD51259.2020.00026","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

数字设计中的外存储器与高响应时间密切相关。缓解延迟的最常见方法是在内存子系统中添加缓存机制。这个解决方案在CPU体系结构中可能是足够的，我们可以在发生缓存丢失时重新安排操作。然而，FPGA架构通常是具有简单功能的加速器，因此不可能推迟工作。缓存丢失通常会导致整个管道停滞甚至数据丢失。我们在本文中提出的架构通过将算术运算聚合到内存子系统本身来减少这个问题。实现了快速数据处理，因为处理外部数据的算术运算被卸载。我们的架构达到200 Mp/s的速度(执行的操作)。它被设计用于连接速度为100gb /s的系统。它的性能至少比其他实现高出3倍。我们架构的另一个好处是在实际数据集上将内存事务的数量减少了两倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Pipelined ALU for effective external memory access in FPGA

The external memories in digital design are closely related to high response time. The most common approach to mitigate latency is adding a caching mechanism into the memory subsystem. This solution might be sufficient in CPU architecture, where we can reschedule operations when a cache miss occurs. However, the FPGA architectures are usually accelerators with simple functionality, where it is not possible to postpone work. The cache miss often leads to whole pipeline stall or even to data loss. The architecture we present in this paper reduces this problem by aggregating arithmetic operations into the memory subsystem itself. Fast data processing is achieved because arithmetic operations working with external data are offloaded. Our architecture reaches a speed of 200 Mp/s (operations carried out). It is designed to be used in systems with link speeds of 100 Gb/s. It outperforms other implementations by a factor of at least 3. The additional benefit of our architecture is reducing the number of memory transactions by a factor of two on real-world datasets.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 23rd Euromicro Conference on Digital System Design (DSD)

自引率

0.00%

发文量