{"title":"VLIW架构的更高性能和更低功耗增强","authors":"W. Gass","doi":"10.1109/SIPS.2001.957342","DOIUrl":null,"url":null,"abstract":"Summary form only given. Architecture enhancements to the C6000 architecture have improved performance, reduced code size, lowered power, and increased compiler efficiency. Benchmarks of DSP kernels and typical DSP applications are used to compare commercially available DSP in terms of cycle count, power, and compiler efficiency. The C6000 VLIW family is an 8-issue instruction architecture that has four execution units for each of the two register banks. The C62x, first-generation processor runs at 300 MHz, has 2 multipliers, and dual 32-bit read/write ports. The 64x, second-generation processor extends the performance by increasing the speed to 600 MHz, adding 2 more multipliers and increasing the load/store width to 64-bits. In addition, the 64x adds SIMD operations to support packed data operations. The 62x is an excellent compiler target due to deterministic order and time of instruction execution, a general purpose 32-word register file, simple independent instructions, and no special modes or status bits. The 64x has improved the compiler efficiency by increasing the register file to 64 words, increasing the number of common instructions that will execute on each unit, and providing for non-aligned loads of packed data. The 64x reduces code size by decreasing the number of NOP with non-aligned program memory fetches and by adding complex instructions that combine several RISC instructions into one 32-bit opcode. The 64x reduces power by adding a 2-level on-chip cache, thereby enabling most of the memory accesses to hit the smaller first level cache. In addition, a reduction in code size decreases the number of first-level instruction fetches and the larger register file decreases the number of data memory accesses. The second-generation processor has been optimized for image, graphics, and telecommunication applications. For 2D algorithms such as 30 correlation, median filtering, motion estimation and polyphase filter, the cycle count improvements for the kernels range from 2.3x to 7.6x. For communication algorithms such as Reed Solomon decoding, Viterbi decoding and FFT, the cycle count improvements of the kernels range from 2.1 x to 3.5x.","PeriodicalId":246898,"journal":{"name":"2001 IEEE Workshop on Signal Processing Systems. SiPS 2001. Design and Implementation (Cat. No.01TH8578)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2001-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Higher performance and lower power enhancements to VLIW architectures\",\"authors\":\"W. Gass\",\"doi\":\"10.1109/SIPS.2001.957342\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Summary form only given. Architecture enhancements to the C6000 architecture have improved performance, reduced code size, lowered power, and increased compiler efficiency. Benchmarks of DSP kernels and typical DSP applications are used to compare commercially available DSP in terms of cycle count, power, and compiler efficiency. The C6000 VLIW family is an 8-issue instruction architecture that has four execution units for each of the two register banks. The C62x, first-generation processor runs at 300 MHz, has 2 multipliers, and dual 32-bit read/write ports. The 64x, second-generation processor extends the performance by increasing the speed to 600 MHz, adding 2 more multipliers and increasing the load/store width to 64-bits. In addition, the 64x adds SIMD operations to support packed data operations. The 62x is an excellent compiler target due to deterministic order and time of instruction execution, a general purpose 32-word register file, simple independent instructions, and no special modes or status bits. The 64x has improved the compiler efficiency by increasing the register file to 64 words, increasing the number of common instructions that will execute on each unit, and providing for non-aligned loads of packed data. The 64x reduces code size by decreasing the number of NOP with non-aligned program memory fetches and by adding complex instructions that combine several RISC instructions into one 32-bit opcode. The 64x reduces power by adding a 2-level on-chip cache, thereby enabling most of the memory accesses to hit the smaller first level cache. In addition, a reduction in code size decreases the number of first-level instruction fetches and the larger register file decreases the number of data memory accesses. The second-generation processor has been optimized for image, graphics, and telecommunication applications. For 2D algorithms such as 30 correlation, median filtering, motion estimation and polyphase filter, the cycle count improvements for the kernels range from 2.3x to 7.6x. For communication algorithms such as Reed Solomon decoding, Viterbi decoding and FFT, the cycle count improvements of the kernels range from 2.1 x to 3.5x.\",\"PeriodicalId\":246898,\"journal\":{\"name\":\"2001 IEEE Workshop on Signal Processing Systems. SiPS 2001. Design and Implementation (Cat. No.01TH8578)\",\"volume\":\"23 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2001-09-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2001 IEEE Workshop on Signal Processing Systems. SiPS 2001. Design and Implementation (Cat. No.01TH8578)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SIPS.2001.957342\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2001 IEEE Workshop on Signal Processing Systems. SiPS 2001. Design and Implementation (Cat. No.01TH8578)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SIPS.2001.957342","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Higher performance and lower power enhancements to VLIW architectures
Summary form only given. Architecture enhancements to the C6000 architecture have improved performance, reduced code size, lowered power, and increased compiler efficiency. Benchmarks of DSP kernels and typical DSP applications are used to compare commercially available DSP in terms of cycle count, power, and compiler efficiency. The C6000 VLIW family is an 8-issue instruction architecture that has four execution units for each of the two register banks. The C62x, first-generation processor runs at 300 MHz, has 2 multipliers, and dual 32-bit read/write ports. The 64x, second-generation processor extends the performance by increasing the speed to 600 MHz, adding 2 more multipliers and increasing the load/store width to 64-bits. In addition, the 64x adds SIMD operations to support packed data operations. The 62x is an excellent compiler target due to deterministic order and time of instruction execution, a general purpose 32-word register file, simple independent instructions, and no special modes or status bits. The 64x has improved the compiler efficiency by increasing the register file to 64 words, increasing the number of common instructions that will execute on each unit, and providing for non-aligned loads of packed data. The 64x reduces code size by decreasing the number of NOP with non-aligned program memory fetches and by adding complex instructions that combine several RISC instructions into one 32-bit opcode. The 64x reduces power by adding a 2-level on-chip cache, thereby enabling most of the memory accesses to hit the smaller first level cache. In addition, a reduction in code size decreases the number of first-level instruction fetches and the larger register file decreases the number of data memory accesses. The second-generation processor has been optimized for image, graphics, and telecommunication applications. For 2D algorithms such as 30 correlation, median filtering, motion estimation and polyphase filter, the cycle count improvements for the kernels range from 2.3x to 7.6x. For communication algorithms such as Reed Solomon decoding, Viterbi decoding and FFT, the cycle count improvements of the kernels range from 2.1 x to 3.5x.