{"title":"基于可扩展矩阵处理器的DCT实现及性能评价","authors":"M. Soliman, A. F. Al-Junaid","doi":"10.1109/ICCES.2010.5674882","DOIUrl":null,"url":null,"abstract":"Discrete cosine transform (DCT) is one of the major operations in various image/video compression standards. This paper implements DCT and its inverse (IDCT) on our proposed Mat-Core processor using scalar/vector/matrix instruction sets. Mat-Core extends a general-purpose scalar processor with a matrix unit for processing vector/matrix data. The extended matrix unit is decoupled into two components to hide memory latency: address generation and data computation, which communicate through data queues. The data computation unit is organized in parallel lanes, which can execute scalar-vector, vector-vector, scalar-matrix, vector-matrix, and matrix-matrix instructions. To show the scalability of Mat-Core architecture, the performance of DCT and IDCT are evaluated on Mat-Core with different number of parallel lanes (one, four, and eight lanes). A cycle accurate model of Mat-Core processor is implemented using SystemC (system level modeling language). Our results show performances of 1.5, 5, 6.4 and 14.4 FLOPs/cycle on Mat-Core with single lane and 8-element vector registers, four lanes and 4×4 matrix registers, four lanes and 8×4 matrix registers, and eight lanes and 8×8 matrix registers, respectively. The maximum performance of the Mat-Core processor on DCT and IDCT represents 90% of the ideal value. Moreover, increasing the number of parallel lanes from one to four and then to eight results in speeding up the execution of DCT and IDCT by factors of 4.2 and 9.5, respectively, which indicates the scalability of Mat-Core architecture.","PeriodicalId":124411,"journal":{"name":"The 2010 International Conference on Computer Engineering & Systems","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"DCT implementation and performance evaluation on a scalable matrix processor\",\"authors\":\"M. Soliman, A. F. Al-Junaid\",\"doi\":\"10.1109/ICCES.2010.5674882\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Discrete cosine transform (DCT) is one of the major operations in various image/video compression standards. This paper implements DCT and its inverse (IDCT) on our proposed Mat-Core processor using scalar/vector/matrix instruction sets. Mat-Core extends a general-purpose scalar processor with a matrix unit for processing vector/matrix data. The extended matrix unit is decoupled into two components to hide memory latency: address generation and data computation, which communicate through data queues. The data computation unit is organized in parallel lanes, which can execute scalar-vector, vector-vector, scalar-matrix, vector-matrix, and matrix-matrix instructions. To show the scalability of Mat-Core architecture, the performance of DCT and IDCT are evaluated on Mat-Core with different number of parallel lanes (one, four, and eight lanes). A cycle accurate model of Mat-Core processor is implemented using SystemC (system level modeling language). Our results show performances of 1.5, 5, 6.4 and 14.4 FLOPs/cycle on Mat-Core with single lane and 8-element vector registers, four lanes and 4×4 matrix registers, four lanes and 8×4 matrix registers, and eight lanes and 8×8 matrix registers, respectively. The maximum performance of the Mat-Core processor on DCT and IDCT represents 90% of the ideal value. Moreover, increasing the number of parallel lanes from one to four and then to eight results in speeding up the execution of DCT and IDCT by factors of 4.2 and 9.5, respectively, which indicates the scalability of Mat-Core architecture.\",\"PeriodicalId\":124411,\"journal\":{\"name\":\"The 2010 International Conference on Computer Engineering & Systems\",\"volume\":\"25 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-12-23\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The 2010 International Conference on Computer Engineering & Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCES.2010.5674882\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The 2010 International Conference on Computer Engineering & Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCES.2010.5674882","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
DCT implementation and performance evaluation on a scalable matrix processor
Discrete cosine transform (DCT) is one of the major operations in various image/video compression standards. This paper implements DCT and its inverse (IDCT) on our proposed Mat-Core processor using scalar/vector/matrix instruction sets. Mat-Core extends a general-purpose scalar processor with a matrix unit for processing vector/matrix data. The extended matrix unit is decoupled into two components to hide memory latency: address generation and data computation, which communicate through data queues. The data computation unit is organized in parallel lanes, which can execute scalar-vector, vector-vector, scalar-matrix, vector-matrix, and matrix-matrix instructions. To show the scalability of Mat-Core architecture, the performance of DCT and IDCT are evaluated on Mat-Core with different number of parallel lanes (one, four, and eight lanes). A cycle accurate model of Mat-Core processor is implemented using SystemC (system level modeling language). Our results show performances of 1.5, 5, 6.4 and 14.4 FLOPs/cycle on Mat-Core with single lane and 8-element vector registers, four lanes and 4×4 matrix registers, four lanes and 8×4 matrix registers, and eight lanes and 8×8 matrix registers, respectively. The maximum performance of the Mat-Core processor on DCT and IDCT represents 90% of the ideal value. Moreover, increasing the number of parallel lanes from one to four and then to eight results in speeding up the execution of DCT and IDCT by factors of 4.2 and 9.5, respectively, which indicates the scalability of Mat-Core architecture.