{"title":"Design of a Loeffler DCT using Xilinx Vivado HLS (Abstract Only)","authors":"Seung Yeol Baik, S. Jeong, H. Oh","doi":"10.1145/2684746.2694735","DOIUrl":null,"url":null,"abstract":"Loeffler discrete cosine transform (DCT) algorithm is recognized as the most efficient one because it requires the theoretically least number of multiplications. However, many applications still encounter difficulty in performing the 11 multiplications required by the algorithm to calculate a 1D eight-point DCT. To avoid expensive multipliers in the hardware, we used two design methods, namely, distributed arithmetic (DA) and shift-and-add (SAA) methods, to design the DCT accelerator. The memory bandwidth is 60 bits: 24 bits for reads of the R(red), G(green), and B(blue) data of a pixel and 36 bits for writes of three corresponding 12-bit DCT coefficients. Thus, the 1D eight-point DCT accelerator for each of R, G, and B can have one 12-bit input port and one 12-bit output port so that it can calculate a 2D DCT by row-column decomposition method. The designs are adjusted to produce the same latency and interval. DA seems promising because Loeffler DCT requires only three small tables with four input bits. However, our experiments using Xilinx Vivado HLS show that the SAA design is better than the DA design for the considered applications. Furthermore, simulation results suggest that the optimal accelerator design can be obtained by adjusting the SAA design to the considered applications. The resultant SAA design requires only 13 adders (per color component) and can calculate one DCT coefficient per clock cycle. The precision of the internal hardware has been adjusted, such that the reconstructed images have PSNR values of at least 39.1 dB for all test images (Lenna, Pepper, House, and Cameraman). If a precision of 13bits is allowed, PSNR becomes at least 44.8 dB. Our presentation describes the architecture and operation of the optimized SAA design.","PeriodicalId":388546,"journal":{"name":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"2 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2015 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2684746.2694735","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Loeffler discrete cosine transform (DCT) algorithm is recognized as the most efficient one because it requires the theoretically least number of multiplications. However, many applications still encounter difficulty in performing the 11 multiplications required by the algorithm to calculate a 1D eight-point DCT. To avoid expensive multipliers in the hardware, we used two design methods, namely, distributed arithmetic (DA) and shift-and-add (SAA) methods, to design the DCT accelerator. The memory bandwidth is 60 bits: 24 bits for reads of the R(red), G(green), and B(blue) data of a pixel and 36 bits for writes of three corresponding 12-bit DCT coefficients. Thus, the 1D eight-point DCT accelerator for each of R, G, and B can have one 12-bit input port and one 12-bit output port so that it can calculate a 2D DCT by row-column decomposition method. The designs are adjusted to produce the same latency and interval. DA seems promising because Loeffler DCT requires only three small tables with four input bits. However, our experiments using Xilinx Vivado HLS show that the SAA design is better than the DA design for the considered applications. Furthermore, simulation results suggest that the optimal accelerator design can be obtained by adjusting the SAA design to the considered applications. The resultant SAA design requires only 13 adders (per color component) and can calculate one DCT coefficient per clock cycle. The precision of the internal hardware has been adjusted, such that the reconstructed images have PSNR values of at least 39.1 dB for all test images (Lenna, Pepper, House, and Cameraman). If a precision of 13bits is allowed, PSNR becomes at least 44.8 dB. Our presentation describes the architecture and operation of the optimized SAA design.