Design of a Loeffler DCT using Xilinx Vivado HLS (Abstract Only)

Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays Pub Date : 2015-02-22 DOI:10.1145/2684746.2694735

Seung Yeol Baik, S. Jeong, H. Oh

{"title":"Design of a Loeffler DCT using Xilinx Vivado HLS (Abstract Only)","authors":"Seung Yeol Baik, S. Jeong, H. Oh","doi":"10.1145/2684746.2694735","DOIUrl":null,"url":null,"abstract":"Loeffler discrete cosine transform (DCT) algorithm is recognized as the most efficient one because it requires the theoretically least number of multiplications. However, many applications still encounter difficulty in performing the 11 multiplications required by the algorithm to calculate a 1D eight-point DCT. To avoid expensive multipliers in the hardware, we used two design methods, namely, distributed arithmetic (DA) and shift-and-add (SAA) methods, to design the DCT accelerator. The memory bandwidth is 60 bits: 24 bits for reads of the R(red), G(green), and B(blue) data of a pixel and 36 bits for writes of three corresponding 12-bit DCT coefficients. Thus, the 1D eight-point DCT accelerator for each of R, G, and B can have one 12-bit input port and one 12-bit output port so that it can calculate a 2D DCT by row-column decomposition method. The designs are adjusted to produce the same latency and interval. DA seems promising because Loeffler DCT requires only three small tables with four input bits. However, our experiments using Xilinx Vivado HLS show that the SAA design is better than the DA design for the considered applications. Furthermore, simulation results suggest that the optimal accelerator design can be obtained by adjusting the SAA design to the considered applications. The resultant SAA design requires only 13 adders (per color component) and can calculate one DCT coefficient per clock cycle. The precision of the internal hardware has been adjusted, such that the reconstructed images have PSNR values of at least 39.1 dB for all test images (Lenna, Pepper, House, and Cameraman). If a precision of 13bits is allowed, PSNR becomes at least 44.8 dB. Our presentation describes the architecture and operation of the optimized SAA design.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2684746.2694735","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Loeffler discrete cosine transform (DCT) algorithm is recognized as the most efficient one because it requires the theoretically least number of multiplications. However, many applications still encounter difficulty in performing the 11 multiplications required by the algorithm to calculate a 1D eight-point DCT. To avoid expensive multipliers in the hardware, we used two design methods, namely, distributed arithmetic (DA) and shift-and-add (SAA) methods, to design the DCT accelerator. The memory bandwidth is 60 bits: 24 bits for reads of the R(red), G(green), and B(blue) data of a pixel and 36 bits for writes of three corresponding 12-bit DCT coefficients. Thus, the 1D eight-point DCT accelerator for each of R, G, and B can have one 12-bit input port and one 12-bit output port so that it can calculate a 2D DCT by row-column decomposition method. The designs are adjusted to produce the same latency and interval. DA seems promising because Loeffler DCT requires only three small tables with four input bits. However, our experiments using Xilinx Vivado HLS show that the SAA design is better than the DA design for the considered applications. Furthermore, simulation results suggest that the optimal accelerator design can be obtained by adjusting the SAA design to the considered applications. The resultant SAA design requires only 13 adders (per color component) and can calculate one DCT coefficient per clock cycle. The precision of the internal hardware has been adjusted, such that the reconstructed images have PSNR values of at least 39.1 dB for all test images (Lenna, Pepper, House, and Cameraman). If a precision of 13bits is allowed, PSNR becomes at least 44.8 dB. Our presentation describes the architecture and operation of the optimized SAA design.

查看原文本刊更多论文

基于Xilinx Vivado HLS的Loeffler DCT设计(仅摘要)

Loeffler离散余弦变换(DCT)算法由于理论上需要最少的乘法次数而被认为是最有效的一种算法。然而，许多应用程序在执行该算法计算一维八点DCT所需的11次乘法时仍然遇到困难。为了避免在硬件中使用昂贵的乘法器，我们采用了两种设计方法，即分布式算法(DA)和移位加法(SAA)方法来设计DCT加速器。内存带宽为60比特，其中读取一个像素的R(红色)、G(绿色)和B(蓝色)数据的带宽为24比特，写入三个对应的12位DCT系数的带宽为36比特。因此，R、G、B的一维八点DCT加速器可以各有一个12位输入端口和一个12位输出端口，这样就可以用行-列分解法计算二维DCT。调整设计以产生相同的延迟和间隔。数据处理看起来很有前途，因为Loeffler DCT只需要三个小表和四个输入位。然而，我们使用Xilinx Vivado HLS进行的实验表明，对于所考虑的应用，SAA设计优于DA设计。此外，仿真结果表明，通过调整SAA设计以适应所考虑的应用，可以获得最佳的加速器设计。由此产生的SAA设计只需要13个加法器(每个颜色组件)，并且每个时钟周期可以计算一个DCT系数。对内部硬件的精度进行了调整，使得所有测试图像(Lenna, Pepper, House和Cameraman)的重建图像的PSNR值至少为39.1 dB。如果允许精度为13位，则PSNR至少变为44.8 dB。我们的报告描述了优化的SAA设计的体系结构和操作。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays

自引率

0.00%

发文量