{"title":"Design and Implementation of Multi-Purpose DCT/DST-Specific Accelerator on Heterogeneous Multicore Architecture","authors":"Sajjad Nouri, R. G. Youvalari, J. Nurmi","doi":"10.1109/NORCHIP.2018.8573457","DOIUrl":null,"url":null,"abstract":"This paper presents the implementation of various sizes of Discrete Cosine transform (DCT) and Discrete Sine Transform (DST) dedicated for High Efficiency Video Coding (HEVC) standard by using template-based Coarse-Grained Reconfigurable Arrays (CGRAs) as accelerators on Heterogeneous Accelerator-Rich Platform (HARP). The proposal makes multipurpose DCT/DST specific accelerators in such a way that final architecture consists of 4/8/16/32–point DCT and 4-point DST. The accelerators are primarily designed by crafting template-based CGRA devices at different dimensions and then arranging them on a Network-on-Chip platform along with a few RISC cores. In this research work, the performance of each DCT/DST-specific accelerator, the collective performance of the whole platform and the NoC traffic are recorded in terms of the number of clock cycles and several high-level performance metrics. Conducted experiments show that 4-point DCT and 4-point DST can be implemented completely in 54 and 56 clock cycles, respectively, while for 8/16/32–point DCT, 67, 179 and 354 clock cycles are required, respectively. The achieved total power dissipation and energy consumption based on post placement and routing information are equal to 4.1 W and $10.87~\\mu {\\mathrm {J}}$, respectively with 256 instantiated Processing Elements (PEs) at 200.0 MHz operating frequency. It resulted to a performance of 51.2 Giga Operations Per Second (GOPS) and 12 MOPS/mW as an architectural constant for the HARP template on 28 nm Altera Stratix-V chip. The proposed architecture is able to sustain Full HD 1080p format at 30 fps on FPGA.","PeriodicalId":152077,"journal":{"name":"2018 IEEE Nordic Circuits and Systems Conference (NORCAS): NORCHIP and International Symposium of System-on-Chip (SoC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE Nordic Circuits and Systems Conference (NORCAS): NORCHIP and International Symposium of System-on-Chip (SoC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NORCHIP.2018.8573457","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
This paper presents the implementation of various sizes of Discrete Cosine transform (DCT) and Discrete Sine Transform (DST) dedicated for High Efficiency Video Coding (HEVC) standard by using template-based Coarse-Grained Reconfigurable Arrays (CGRAs) as accelerators on Heterogeneous Accelerator-Rich Platform (HARP). The proposal makes multipurpose DCT/DST specific accelerators in such a way that final architecture consists of 4/8/16/32–point DCT and 4-point DST. The accelerators are primarily designed by crafting template-based CGRA devices at different dimensions and then arranging them on a Network-on-Chip platform along with a few RISC cores. In this research work, the performance of each DCT/DST-specific accelerator, the collective performance of the whole platform and the NoC traffic are recorded in terms of the number of clock cycles and several high-level performance metrics. Conducted experiments show that 4-point DCT and 4-point DST can be implemented completely in 54 and 56 clock cycles, respectively, while for 8/16/32–point DCT, 67, 179 and 354 clock cycles are required, respectively. The achieved total power dissipation and energy consumption based on post placement and routing information are equal to 4.1 W and $10.87~\mu {\mathrm {J}}$, respectively with 256 instantiated Processing Elements (PEs) at 200.0 MHz operating frequency. It resulted to a performance of 51.2 Giga Operations Per Second (GOPS) and 12 MOPS/mW as an architectural constant for the HARP template on 28 nm Altera Stratix-V chip. The proposed architecture is able to sustain Full HD 1080p format at 30 fps on FPGA.