Higher performance and lower power enhancements to VLIW architectures

2001 IEEE Workshop on Signal Processing Systems. SiPS 2001. Design and Implementation (Cat. No.01TH8578) Pub Date : 2001-09-26 DOI:10.1109/SIPS.2001.957342

W. Gass

{"title":"Higher performance and lower power enhancements to VLIW architectures","authors":"W. Gass","doi":"10.1109/SIPS.2001.957342","DOIUrl":null,"url":null,"abstract":"Summary form only given. Architecture enhancements to the C6000 architecture have improved performance, reduced code size, lowered power, and increased compiler efficiency. Benchmarks of DSP kernels and typical DSP applications are used to compare commercially available DSP in terms of cycle count, power, and compiler efficiency. The C6000 VLIW family is an 8-issue instruction architecture that has four execution units for each of the two register banks. The C62x, first-generation processor runs at 300 MHz, has 2 multipliers, and dual 32-bit read/write ports. The 64x, second-generation processor extends the performance by increasing the speed to 600 MHz, adding 2 more multipliers and increasing the load/store width to 64-bits. In addition, the 64x adds SIMD operations to support packed data operations. The 62x is an excellent compiler target due to deterministic order and time of instruction execution, a general purpose 32-word register file, simple independent instructions, and no special modes or status bits. The 64x has improved the compiler efficiency by increasing the register file to 64 words, increasing the number of common instructions that will execute on each unit, and providing for non-aligned loads of packed data. The 64x reduces code size by decreasing the number of NOP with non-aligned program memory fetches and by adding complex instructions that combine several RISC instructions into one 32-bit opcode. The 64x reduces power by adding a 2-level on-chip cache, thereby enabling most of the memory accesses to hit the smaller first level cache. In addition, a reduction in code size decreases the number of first-level instruction fetches and the larger register file decreases the number of data memory accesses. The second-generation processor has been optimized for image, graphics, and telecommunication applications. For 2D algorithms such as 30 correlation, median filtering, motion estimation and polyphase filter, the cycle count improvements for the kernels range from 2.3x to 7.6x. For communication algorithms such as Reed Solomon decoding, Viterbi decoding and FFT, the cycle count improvements of the kernels range from 2.1 x to 3.5x.","PeriodicalId":246898,"journal":{"name":"2001 IEEE Workshop on Signal Processing Systems. SiPS 2001. Design and Implementation (Cat. No.01TH8578)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2001-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2001 IEEE Workshop on Signal Processing Systems. SiPS 2001. Design and Implementation (Cat. No.01TH8578)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SIPS.2001.957342","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Summary form only given. Architecture enhancements to the C6000 architecture have improved performance, reduced code size, lowered power, and increased compiler efficiency. Benchmarks of DSP kernels and typical DSP applications are used to compare commercially available DSP in terms of cycle count, power, and compiler efficiency. The C6000 VLIW family is an 8-issue instruction architecture that has four execution units for each of the two register banks. The C62x, first-generation processor runs at 300 MHz, has 2 multipliers, and dual 32-bit read/write ports. The 64x, second-generation processor extends the performance by increasing the speed to 600 MHz, adding 2 more multipliers and increasing the load/store width to 64-bits. In addition, the 64x adds SIMD operations to support packed data operations. The 62x is an excellent compiler target due to deterministic order and time of instruction execution, a general purpose 32-word register file, simple independent instructions, and no special modes or status bits. The 64x has improved the compiler efficiency by increasing the register file to 64 words, increasing the number of common instructions that will execute on each unit, and providing for non-aligned loads of packed data. The 64x reduces code size by decreasing the number of NOP with non-aligned program memory fetches and by adding complex instructions that combine several RISC instructions into one 32-bit opcode. The 64x reduces power by adding a 2-level on-chip cache, thereby enabling most of the memory accesses to hit the smaller first level cache. In addition, a reduction in code size decreases the number of first-level instruction fetches and the larger register file decreases the number of data memory accesses. The second-generation processor has been optimized for image, graphics, and telecommunication applications. For 2D algorithms such as 30 correlation, median filtering, motion estimation and polyphase filter, the cycle count improvements for the kernels range from 2.3x to 7.6x. For communication algorithms such as Reed Solomon decoding, Viterbi decoding and FFT, the cycle count improvements of the kernels range from 2.1 x to 3.5x.

查看原文本刊更多论文

VLIW架构的更高性能和更低功耗增强

只提供摘要形式。对C6000体系结构的改进提高了性能，减少了代码量，降低了功耗，提高了编译器效率。DSP内核和典型DSP应用的基准测试用于比较商业上可用的DSP在周期计数、功耗和编译器效率方面的差异。C6000 VLIW系列是一个有8个问题的指令体系结构，两个寄存器库各有四个执行单元。C62x是第一代处理器，运行频率为300mhz，具有2个乘法器和双32位读写端口。第二代64x处理器通过将速度提高到600 MHz，增加2个乘法器并将负载/存储宽度提高到64位来扩展性能。此外，64x增加了SIMD操作来支持打包数据操作。由于指令执行的确定顺序和时间、通用的32字寄存器文件、简单的独立指令以及没有特殊模式或状态位，62x是一个很好的编译器目标。64x通过将寄存器文件增加到64个字，增加将在每个单元上执行的通用指令的数量，并提供非对齐的打包数据加载，从而提高了编译器的效率。64x通过减少非对齐程序内存提取的NOP数量，以及通过添加将多个RISC指令组合成一个32位操作码的复杂指令来减小代码大小。64x通过增加2级片上缓存来降低功耗，从而使大多数内存访问能够访问较小的第一级缓存。此外，代码大小的减少减少了第一级指令读取的数量，更大的寄存器文件减少了数据内存访问的数量。第二代处理器针对图像、图形和电信应用进行了优化。对于2D算法，如30相关、中值滤波、运动估计和多相滤波，内核的循环计数改进范围从2.3倍到7.6倍。对于Reed Solomon解码、Viterbi解码和FFT等通信算法，内核的循环计数改进幅度在2.1到3.5倍之间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2001 IEEE Workshop on Signal Processing Systems. SiPS 2001. Design and Implementation (Cat. No.01TH8578)

自引率

0.00%

发文量