Stream architectures - efficiency and programmability

2004 International Symposium on System-on-Chip, 2004. Proceedings. Pub Date : 2004-11-16 DOI:10.1109/ISSOC.2004.1411141

M. Erez

{"title":"Stream architectures - efficiency and programmability","authors":"M. Erez","doi":"10.1109/ISSOC.2004.1411141","DOIUrl":null,"url":null,"abstract":"Summary form only given. Stream processors are fully programmable in a high-level language, yet are capable of achieving computation efficiency comparable to fixed-function ASIC solutions (about 20 pJ/op) and can be scaled from a Gop/s (20 mW) block to a Top/s (20 W) chip in current semiconductor technology. The parallel nature of stream processors enables their performance to scale with technology. In a 2010 45 nm technology we expect an efficiency of 1 pJ/op and performance of up to 20 Top/s (20 W). A stream processor contains an array of arithmetic units that are supplied with data by a deep and explicit register hierarchy, which also serves to decouple instruction execution from unpredictable and long-latency memory operations. This decoupled and exposed-communication architecture enables a compiler to automatically map a stream application (such as a signal-flow graph) to the processing array: employing \"stream scheduling\" to stage the high-level movement of streams, and \"communication scheduling\" to schedule the data movement in the low-level kernels. This explicit optimization of communication results in almost all data and instruction movement taking place over short wires, and hence almost all energy going to useful computation. We have built a prototype streaming signal processor, Imagine, and have demonstrated streaming applications involving video compression/decompression, wireless communication, and adaptive beam-forming. We are also designing the Merrimac supercomputer, which uses a stream processor based on the same architectural principles as Imagine, illustrating the flexibility, generality, and scalability of the streaming concept. This paper describes stream architectures, stream programming systems, and streaming applications. A comparison is made to conventional DSPs, FPGAs, and ASIC solutions.","PeriodicalId":268122,"journal":{"name":"2004 International Symposium on System-on-Chip, 2004. Proceedings.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2004-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2004 International Symposium on System-on-Chip, 2004. Proceedings.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSOC.2004.1411141","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Summary form only given. Stream processors are fully programmable in a high-level language, yet are capable of achieving computation efficiency comparable to fixed-function ASIC solutions (about 20 pJ/op) and can be scaled from a Gop/s (20 mW) block to a Top/s (20 W) chip in current semiconductor technology. The parallel nature of stream processors enables their performance to scale with technology. In a 2010 45 nm technology we expect an efficiency of 1 pJ/op and performance of up to 20 Top/s (20 W). A stream processor contains an array of arithmetic units that are supplied with data by a deep and explicit register hierarchy, which also serves to decouple instruction execution from unpredictable and long-latency memory operations. This decoupled and exposed-communication architecture enables a compiler to automatically map a stream application (such as a signal-flow graph) to the processing array: employing "stream scheduling" to stage the high-level movement of streams, and "communication scheduling" to schedule the data movement in the low-level kernels. This explicit optimization of communication results in almost all data and instruction movement taking place over short wires, and hence almost all energy going to useful computation. We have built a prototype streaming signal processor, Imagine, and have demonstrated streaming applications involving video compression/decompression, wireless communication, and adaptive beam-forming. We are also designing the Merrimac supercomputer, which uses a stream processor based on the same architectural principles as Imagine, illustrating the flexibility, generality, and scalability of the streaming concept. This paper describes stream architectures, stream programming systems, and streaming applications. A comparison is made to conventional DSPs, FPGAs, and ASIC solutions.

查看原文本刊更多论文

流架构——效率和可编程性

只提供摘要形式。流处理器是用高级语言完全可编程的，但能够实现与固定功能ASIC解决方案(约20 pJ/op)相当的计算效率，并且可以从Gop/s (20 mW)块扩展到当前半导体技术中的Top/s (20 W)芯片。流处理器的并行特性使得它们的性能可以随技术而扩展。在2010年的45纳米技术中，我们预计效率为1 pJ/op，性能高达20 Top/s (20 W)。流处理器包含一组算术单元，这些算术单元由深度和显式寄存器层次结构提供数据，这也有助于将指令执行与不可预测和长延迟的内存操作解耦。这种解耦和暴露的通信架构使编译器能够自动将流应用程序(例如信号流图)映射到处理数组:使用“流调度”来执行流的高级移动，并使用“通信调度”来调度低级内核中的数据移动。这种显式的通信优化导致几乎所有数据和指令的移动都在短线路上进行，因此几乎所有的能量都用于有用的计算。我们已经建立了一个流信号处理器的原型，Imagine，并演示了包括视频压缩/解压缩、无线通信和自适应波束形成的流应用。我们还在设计Merrimac超级计算机，它使用基于与Imagine相同架构原则的流处理器，说明了流概念的灵活性、通用性和可扩展性。本文描述了流架构、流编程系统和流应用。与传统的dsp、fpga和ASIC解决方案进行了比较。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2004 International Symposium on System-on-Chip, 2004. Proceedings.

自引率

0.00%

发文量