基于可扩展矩阵处理器的DCT实现及性能评价

The 2010 International Conference on Computer Engineering & Systems Pub Date : 2010-12-23 DOI:10.1109/ICCES.2010.5674882

M. Soliman, A. F. Al-Junaid

{"title":"基于可扩展矩阵处理器的DCT实现及性能评价","authors":"M. Soliman, A. F. Al-Junaid","doi":"10.1109/ICCES.2010.5674882","DOIUrl":null,"url":null,"abstract":"Discrete cosine transform (DCT) is one of the major operations in various image/video compression standards. This paper implements DCT and its inverse (IDCT) on our proposed Mat-Core processor using scalar/vector/matrix instruction sets. Mat-Core extends a general-purpose scalar processor with a matrix unit for processing vector/matrix data. The extended matrix unit is decoupled into two components to hide memory latency: address generation and data computation, which communicate through data queues. The data computation unit is organized in parallel lanes, which can execute scalar-vector, vector-vector, scalar-matrix, vector-matrix, and matrix-matrix instructions. To show the scalability of Mat-Core architecture, the performance of DCT and IDCT are evaluated on Mat-Core with different number of parallel lanes (one, four, and eight lanes). A cycle accurate model of Mat-Core processor is implemented using SystemC (system level modeling language). Our results show performances of 1.5, 5, 6.4 and 14.4 FLOPs/cycle on Mat-Core with single lane and 8-element vector registers, four lanes and 4×4 matrix registers, four lanes and 8×4 matrix registers, and eight lanes and 8×8 matrix registers, respectively. The maximum performance of the Mat-Core processor on DCT and IDCT represents 90% of the ideal value. Moreover, increasing the number of parallel lanes from one to four and then to eight results in speeding up the execution of DCT and IDCT by factors of 4.2 and 9.5, respectively, which indicates the scalability of Mat-Core architecture.","PeriodicalId":124411,"journal":{"name":"The 2010 International Conference on Computer Engineering & Systems","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DCT implementation and performance evaluation on a scalable matrix processor\",\"authors\":\"M. Soliman, A. F. Al-Junaid\",\"doi\":\"10.1109/ICCES.2010.5674882\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Discrete cosine transform (DCT) is one of the major operations in various image/video compression standards. This paper implements DCT and its inverse (IDCT) on our proposed Mat-Core processor using scalar/vector/matrix instruction sets. Mat-Core extends a general-purpose scalar processor with a matrix unit for processing vector/matrix data. The extended matrix unit is decoupled into two components to hide memory latency: address generation and data computation, which communicate through data queues. The data computation unit is organized in parallel lanes, which can execute scalar-vector, vector-vector, scalar-matrix, vector-matrix, and matrix-matrix instructions. To show the scalability of Mat-Core architecture, the performance of DCT and IDCT are evaluated on Mat-Core with different number of parallel lanes (one, four, and eight lanes). A cycle accurate model of Mat-Core processor is implemented using SystemC (system level modeling language). Our results show performances of 1.5, 5, 6.4 and 14.4 FLOPs/cycle on Mat-Core with single lane and 8-element vector registers, four lanes and 4×4 matrix registers, four lanes and 8×4 matrix registers, and eight lanes and 8×8 matrix registers, respectively. The maximum performance of the Mat-Core processor on DCT and IDCT represents 90% of the ideal value. Moreover, increasing the number of parallel lanes from one to four and then to eight results in speeding up the execution of DCT and IDCT by factors of 4.2 and 9.5, respectively, which indicates the scalability of Mat-Core architecture.\",\"PeriodicalId\":124411,\"journal\":{\"name\":\"The 2010 International Conference on Computer Engineering & Systems\",\"volume\":\"25 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-12-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The 2010 International Conference on Computer Engineering & Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCES.2010.5674882\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The 2010 International Conference on Computer Engineering & Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCES.2010.5674882","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

离散余弦变换(DCT)是各种图像/视频压缩标准中的主要操作之一。本文利用标量/矢量/矩阵指令集在我们提出的Mat-Core处理器上实现了DCT及其逆(IDCT)。Mat-Core扩展了一个通用的标量处理器，带有一个用于处理矢量/矩阵数据的矩阵单元。扩展矩阵单元解耦为两个组件来隐藏内存延迟:地址生成和数据计算，它们通过数据队列进行通信。数据计算单元被组织成并行通道，可以执行标量-向量、向量-向量、标量-矩阵、向量-矩阵和矩阵-矩阵指令。为了展示Mat-Core架构的可扩展性，在不同并行通道数(1、4和8通道)的Mat-Core上对DCT和IDCT的性能进行了评估。利用系统级建模语言SystemC实现了Mat-Core处理器的周期精确模型。我们的研究结果表明，在Mat-Core上，单通道和8元矢量寄存器、四通道和4×4矩阵寄存器、四通道和8×4矩阵寄存器以及八通道和8×8矩阵寄存器的性能分别为1.5、5、6.4和14.4 FLOPs/cycle。Mat-Core处理器在DCT和IDCT上的最大性能代表了理想值的90%。此外，将并行通道数从1条增加到4条，再增加到8条，DCT和IDCT的执行速度分别提高4.2倍和9.5倍，这表明Mat-Core架构具有可扩展性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

DCT implementation and performance evaluation on a scalable matrix processor

Discrete cosine transform (DCT) is one of the major operations in various image/video compression standards. This paper implements DCT and its inverse (IDCT) on our proposed Mat-Core processor using scalar/vector/matrix instruction sets. Mat-Core extends a general-purpose scalar processor with a matrix unit for processing vector/matrix data. The extended matrix unit is decoupled into two components to hide memory latency: address generation and data computation, which communicate through data queues. The data computation unit is organized in parallel lanes, which can execute scalar-vector, vector-vector, scalar-matrix, vector-matrix, and matrix-matrix instructions. To show the scalability of Mat-Core architecture, the performance of DCT and IDCT are evaluated on Mat-Core with different number of parallel lanes (one, four, and eight lanes). A cycle accurate model of Mat-Core processor is implemented using SystemC (system level modeling language). Our results show performances of 1.5, 5, 6.4 and 14.4 FLOPs/cycle on Mat-Core with single lane and 8-element vector registers, four lanes and 4×4 matrix registers, four lanes and 8×4 matrix registers, and eight lanes and 8×8 matrix registers, respectively. The maximum performance of the Mat-Core processor on DCT and IDCT represents 90% of the ideal value. Moreover, increasing the number of parallel lanes from one to four and then to eight results in speeding up the execution of DCT and IDCT by factors of 4.2 and 9.5, respectively, which indicates the scalability of Mat-Core architecture.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

The 2010 International Conference on Computer Engineering & Systems

自引率

0.00%

发文量