在异构多核架构上设计和实现多用途 DCT/DST 特定加速器

2018 IEEE Nordic Circuits and Systems Conference (NORCAS): NORCHIP and International Symposium of System-on-Chip (SoC) Pub Date : 2018-10-01 DOI:10.1109/NORCHIP.2018.8573457

Sajjad Nouri, R. G. Youvalari, J. Nurmi

{"title":"在异构多核架构上设计和实现多用途 DCT/DST 特定加速器","authors":"Sajjad Nouri, R. G. Youvalari, J. Nurmi","doi":"10.1109/NORCHIP.2018.8573457","DOIUrl":null,"url":null,"abstract":"This paper presents the implementation of various sizes of Discrete Cosine transform (DCT) and Discrete Sine Transform (DST) dedicated for High Efficiency Video Coding (HEVC) standard by using template-based Coarse-Grained Reconfigurable Arrays (CGRAs) as accelerators on Heterogeneous Accelerator-Rich Platform (HARP). The proposal makes multipurpose DCT/DST specific accelerators in such a way that final architecture consists of 4/8/16/32–point DCT and 4-point DST. The accelerators are primarily designed by crafting template-based CGRA devices at different dimensions and then arranging them on a Network-on-Chip platform along with a few RISC cores. In this research work, the performance of each DCT/DST-specific accelerator, the collective performance of the whole platform and the NoC traffic are recorded in terms of the number of clock cycles and several high-level performance metrics. Conducted experiments show that 4-point DCT and 4-point DST can be implemented completely in 54 and 56 clock cycles, respectively, while for 8/16/32–point DCT, 67, 179 and 354 clock cycles are required, respectively. The achieved total power dissipation and energy consumption based on post placement and routing information are equal to 4.1 W and $10.87~\\mu {\\mathrm {J}}$, respectively with 256 instantiated Processing Elements (PEs) at 200.0 MHz operating frequency. It resulted to a performance of 51.2 Giga Operations Per Second (GOPS) and 12 MOPS/mW as an architectural constant for the HARP template on 28 nm Altera Stratix-V chip. The proposed architecture is able to sustain Full HD 1080p format at 30 fps on FPGA.","PeriodicalId":152077,"journal":{"name":"2018 IEEE Nordic Circuits and Systems Conference (NORCAS): NORCHIP and International Symposium of System-on-Chip (SoC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Design and Implementation of Multi-Purpose DCT/DST-Specific Accelerator on Heterogeneous Multicore Architecture\",\"authors\":\"Sajjad Nouri, R. G. Youvalari, J. Nurmi\",\"doi\":\"10.1109/NORCHIP.2018.8573457\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents the implementation of various sizes of Discrete Cosine transform (DCT) and Discrete Sine Transform (DST) dedicated for High Efficiency Video Coding (HEVC) standard by using template-based Coarse-Grained Reconfigurable Arrays (CGRAs) as accelerators on Heterogeneous Accelerator-Rich Platform (HARP). The proposal makes multipurpose DCT/DST specific accelerators in such a way that final architecture consists of 4/8/16/32–point DCT and 4-point DST. The accelerators are primarily designed by crafting template-based CGRA devices at different dimensions and then arranging them on a Network-on-Chip platform along with a few RISC cores. In this research work, the performance of each DCT/DST-specific accelerator, the collective performance of the whole platform and the NoC traffic are recorded in terms of the number of clock cycles and several high-level performance metrics. Conducted experiments show that 4-point DCT and 4-point DST can be implemented completely in 54 and 56 clock cycles, respectively, while for 8/16/32–point DCT, 67, 179 and 354 clock cycles are required, respectively. The achieved total power dissipation and energy consumption based on post placement and routing information are equal to 4.1 W and $10.87~\\\\mu {\\\\mathrm {J}}$, respectively with 256 instantiated Processing Elements (PEs) at 200.0 MHz operating frequency. It resulted to a performance of 51.2 Giga Operations Per Second (GOPS) and 12 MOPS/mW as an architectural constant for the HARP template on 28 nm Altera Stratix-V chip. The proposed architecture is able to sustain Full HD 1080p format at 30 fps on FPGA.\",\"PeriodicalId\":152077,\"journal\":{\"name\":\"2018 IEEE Nordic Circuits and Systems Conference (NORCAS): NORCHIP and International Symposium of System-on-Chip (SoC)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 IEEE Nordic Circuits and Systems Conference (NORCAS): NORCHIP and International Symposium of System-on-Chip (SoC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/NORCHIP.2018.8573457\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 IEEE Nordic Circuits and Systems Conference (NORCAS): NORCHIP and International Symposium of System-on-Chip (SoC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NORCHIP.2018.8573457","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

本文介绍了在富异构加速器平台（HARP）上使用基于模板的粗粒度可重构阵列（CGRA）作为加速器，实现高效视频编码（HEVC）标准专用的各种规模的离散余弦变换（DCT）和离散正弦变换（DST）。该提案采用多用途 DCT/DST 专用加速器，最终架构包括 4/8/16/32 点 DCT 和 4 点 DST。这些加速器主要是通过制作不同尺寸的基于模板的 CGRA 器件来设计的，然后将它们与一些 RISC 内核一起布置在片上网络平台上。在这项研究工作中，每个 DCT/DST 特定加速器的性能、整个平台的总体性能和 NoC 流量都以时钟周期数和几个高级性能指标的形式记录下来。实验表明，4 点 DCT 和 4 点 DST 可分别在 54 和 56 个时钟周期内完全实现，而 8/16/32 点 DCT 则分别需要 67、179 和 354 个时钟周期。在 200.0 MHz 工作频率下，256 个实例化处理单元（PE）的总功耗和能耗分别为 4.1 W 和 10.87~\mu {\mathrm {J}}$ 。在 28 nm Altera Stratix-V 芯片上，HARP 模板的架构常数为 51.2 Giga Operations Per Second (GOPS) 和 12 MOPS/mW。所提出的架构能够在 FPGA 上以 30 fps 的速度支持全高清 1080p 格式。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Design and Implementation of Multi-Purpose DCT/DST-Specific Accelerator on Heterogeneous Multicore Architecture

This paper presents the implementation of various sizes of Discrete Cosine transform (DCT) and Discrete Sine Transform (DST) dedicated for High Efficiency Video Coding (HEVC) standard by using template-based Coarse-Grained Reconfigurable Arrays (CGRAs) as accelerators on Heterogeneous Accelerator-Rich Platform (HARP). The proposal makes multipurpose DCT/DST specific accelerators in such a way that final architecture consists of 4/8/16/32–point DCT and 4-point DST. The accelerators are primarily designed by crafting template-based CGRA devices at different dimensions and then arranging them on a Network-on-Chip platform along with a few RISC cores. In this research work, the performance of each DCT/DST-specific accelerator, the collective performance of the whole platform and the NoC traffic are recorded in terms of the number of clock cycles and several high-level performance metrics. Conducted experiments show that 4-point DCT and 4-point DST can be implemented completely in 54 and 56 clock cycles, respectively, while for 8/16/32–point DCT, 67, 179 and 354 clock cycles are required, respectively. The achieved total power dissipation and energy consumption based on post placement and routing information are equal to 4.1 W and $10.87~\mu {\mathrm {J}}$, respectively with 256 instantiated Processing Elements (PEs) at 200.0 MHz operating frequency. It resulted to a performance of 51.2 Giga Operations Per Second (GOPS) and 12 MOPS/mW as an architectural constant for the HARP template on 28 nm Altera Stratix-V chip. The proposed architecture is able to sustain Full HD 1080p format at 30 fps on FPGA.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 IEEE Nordic Circuits and Systems Conference (NORCAS): NORCHIP and International Symposium of System-on-Chip (SoC)

自引率

0.00%

发文量