改进内存访问，在异构系统上实现性能和能源效率

2014 International Conference on Field-Programmable Technology (FPT) Pub Date : 2014-12-01 DOI:10.1109/FPT.2014.7082759

Hongyuan Ding, Miaoqing Huang

{"title":"改进内存访问，在异构系统上实现性能和能源效率","authors":"Hongyuan Ding, Miaoqing Huang","doi":"10.1109/FPT.2014.7082759","DOIUrl":null,"url":null,"abstract":"Hardware accelerators are capable of achieving significant performance improvement for many applications. In this work we demonstrate that it is critical to provide sufficient memory access bandwidth for accelerators to improve the performance and reduce energy consumption. We use the scale-invariant feature transform (SIFT) algorithm as a case study in which three bottleneck stages are accelerated on hardware logic. Based on different memory access patterns of SIFT algorithms, two different approaches are designed to accelerate different functions in SIFT on the Xilinx Zynq-7045 device. In the first approach, convolution is accelerated by designing fully customized hardware accelerator. On top of it, three interfacing methods are analyzed. In the second approach, a distributed multi-processor hardware system with its programming model is built to handle inconsecutive memory accesses. Furthermore, the last level cache (LLC) on the host processor is shared by all slaves to achieve better performance. Experiment results on the Zynq-7045 device show that the hybrid design in which two approaches are combined can achieve ~10 times and better improvement for both performance improvement and energy reduction compared with the pure software implementation for the convolution stage and the SIFT algorithm, respectively.","PeriodicalId":6877,"journal":{"name":"2014 International Conference on Field-Programmable Technology (FPT)","volume":"30 1","pages":"91-98"},"PeriodicalIF":0.0000,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Improve memory access for achieving both performance and energy efficiencies on heterogeneous systems\",\"authors\":\"Hongyuan Ding, Miaoqing Huang\",\"doi\":\"10.1109/FPT.2014.7082759\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Hardware accelerators are capable of achieving significant performance improvement for many applications. In this work we demonstrate that it is critical to provide sufficient memory access bandwidth for accelerators to improve the performance and reduce energy consumption. We use the scale-invariant feature transform (SIFT) algorithm as a case study in which three bottleneck stages are accelerated on hardware logic. Based on different memory access patterns of SIFT algorithms, two different approaches are designed to accelerate different functions in SIFT on the Xilinx Zynq-7045 device. In the first approach, convolution is accelerated by designing fully customized hardware accelerator. On top of it, three interfacing methods are analyzed. In the second approach, a distributed multi-processor hardware system with its programming model is built to handle inconsecutive memory accesses. Furthermore, the last level cache (LLC) on the host processor is shared by all slaves to achieve better performance. Experiment results on the Zynq-7045 device show that the hybrid design in which two approaches are combined can achieve ~10 times and better improvement for both performance improvement and energy reduction compared with the pure software implementation for the convolution stage and the SIFT algorithm, respectively.\",\"PeriodicalId\":6877,\"journal\":{\"name\":\"2014 International Conference on Field-Programmable Technology (FPT)\",\"volume\":\"30 1\",\"pages\":\"91-98\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 International Conference on Field-Programmable Technology (FPT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/FPT.2014.7082759\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 International Conference on Field-Programmable Technology (FPT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FPT.2014.7082759","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

硬件加速器能够为许多应用程序实现显著的性能改进。在这项工作中，我们证明了为加速器提供足够的内存访问带宽对于提高性能和降低能耗至关重要。我们使用尺度不变特征变换(SIFT)算法作为案例研究，其中三个瓶颈阶段在硬件逻辑上加速。基于SIFT算法的不同内存访问模式，设计了两种不同的方法来加速Xilinx Zynq-7045器件上SIFT中的不同功能。在第一种方法中，通过设计完全定制的硬件加速器来加速卷积。在此基础上，分析了三种接口方法。在第二种方法中，建立了一个分布式多处理器硬件系统及其编程模型来处理非连续内存访问。此外，主机处理器上的最后一级缓存(LLC)由所有从服务器共享，以获得更好的性能。在Zynq-7045设备上的实验结果表明，两种方法相结合的混合设计在卷积阶段和SIFT算法的性能提升和能耗降低方面分别比纯软件实现提高了10倍以上。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Improve memory access for achieving both performance and energy efficiencies on heterogeneous systems

Hardware accelerators are capable of achieving significant performance improvement for many applications. In this work we demonstrate that it is critical to provide sufficient memory access bandwidth for accelerators to improve the performance and reduce energy consumption. We use the scale-invariant feature transform (SIFT) algorithm as a case study in which three bottleneck stages are accelerated on hardware logic. Based on different memory access patterns of SIFT algorithms, two different approaches are designed to accelerate different functions in SIFT on the Xilinx Zynq-7045 device. In the first approach, convolution is accelerated by designing fully customized hardware accelerator. On top of it, three interfacing methods are analyzed. In the second approach, a distributed multi-processor hardware system with its programming model is built to handle inconsecutive memory accesses. Furthermore, the last level cache (LLC) on the host processor is shared by all slaves to achieve better performance. Experiment results on the Zynq-7045 device show that the hybrid design in which two approaches are combined can achieve ~10 times and better improvement for both performance improvement and energy reduction compared with the pure software implementation for the convolution stage and the SIFT algorithm, respectively.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 International Conference on Field-Programmable Technology (FPT)

自引率

0.00%

发文量