Stream architectures - efficiency and programmability

M. Erez
{"title":"Stream architectures - efficiency and programmability","authors":"M. Erez","doi":"10.1109/ISSOC.2004.1411141","DOIUrl":null,"url":null,"abstract":"Summary form only given. Stream processors are fully programmable in a high-level language, yet are capable of achieving computation efficiency comparable to fixed-function ASIC solutions (about 20 pJ/op) and can be scaled from a Gop/s (20 mW) block to a Top/s (20 W) chip in current semiconductor technology. The parallel nature of stream processors enables their performance to scale with technology. In a 2010 45 nm technology we expect an efficiency of 1 pJ/op and performance of up to 20 Top/s (20 W). A stream processor contains an array of arithmetic units that are supplied with data by a deep and explicit register hierarchy, which also serves to decouple instruction execution from unpredictable and long-latency memory operations. This decoupled and exposed-communication architecture enables a compiler to automatically map a stream application (such as a signal-flow graph) to the processing array: employing \"stream scheduling\" to stage the high-level movement of streams, and \"communication scheduling\" to schedule the data movement in the low-level kernels. This explicit optimization of communication results in almost all data and instruction movement taking place over short wires, and hence almost all energy going to useful computation. We have built a prototype streaming signal processor, Imagine, and have demonstrated streaming applications involving video compression/decompression, wireless communication, and adaptive beam-forming. We are also designing the Merrimac supercomputer, which uses a stream processor based on the same architectural principles as Imagine, illustrating the flexibility, generality, and scalability of the streaming concept. This paper describes stream architectures, stream programming systems, and streaming applications. A comparison is made to conventional DSPs, FPGAs, and ASIC solutions.","PeriodicalId":268122,"journal":{"name":"2004 International Symposium on System-on-Chip, 2004. Proceedings.","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2004-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2004 International Symposium on System-on-Chip, 2004. Proceedings.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ISSOC.2004.1411141","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

Summary form only given. Stream processors are fully programmable in a high-level language, yet are capable of achieving computation efficiency comparable to fixed-function ASIC solutions (about 20 pJ/op) and can be scaled from a Gop/s (20 mW) block to a Top/s (20 W) chip in current semiconductor technology. The parallel nature of stream processors enables their performance to scale with technology. In a 2010 45 nm technology we expect an efficiency of 1 pJ/op and performance of up to 20 Top/s (20 W). A stream processor contains an array of arithmetic units that are supplied with data by a deep and explicit register hierarchy, which also serves to decouple instruction execution from unpredictable and long-latency memory operations. This decoupled and exposed-communication architecture enables a compiler to automatically map a stream application (such as a signal-flow graph) to the processing array: employing "stream scheduling" to stage the high-level movement of streams, and "communication scheduling" to schedule the data movement in the low-level kernels. This explicit optimization of communication results in almost all data and instruction movement taking place over short wires, and hence almost all energy going to useful computation. We have built a prototype streaming signal processor, Imagine, and have demonstrated streaming applications involving video compression/decompression, wireless communication, and adaptive beam-forming. We are also designing the Merrimac supercomputer, which uses a stream processor based on the same architectural principles as Imagine, illustrating the flexibility, generality, and scalability of the streaming concept. This paper describes stream architectures, stream programming systems, and streaming applications. A comparison is made to conventional DSPs, FPGAs, and ASIC solutions.
流架构——效率和可编程性
只提供摘要形式。流处理器是用高级语言完全可编程的,但能够实现与固定功能ASIC解决方案(约20 pJ/op)相当的计算效率,并且可以从Gop/s (20 mW)块扩展到当前半导体技术中的Top/s (20 W)芯片。流处理器的并行特性使得它们的性能可以随技术而扩展。在2010年的45纳米技术中,我们预计效率为1 pJ/op,性能高达20 Top/s (20 W)。流处理器包含一组算术单元,这些算术单元由深度和显式寄存器层次结构提供数据,这也有助于将指令执行与不可预测和长延迟的内存操作解耦。这种解耦和暴露的通信架构使编译器能够自动将流应用程序(例如信号流图)映射到处理数组:使用“流调度”来执行流的高级移动,并使用“通信调度”来调度低级内核中的数据移动。这种显式的通信优化导致几乎所有数据和指令的移动都在短线路上进行,因此几乎所有的能量都用于有用的计算。我们已经建立了一个流信号处理器的原型,Imagine,并演示了包括视频压缩/解压缩、无线通信和自适应波束形成的流应用。我们还在设计Merrimac超级计算机,它使用基于与Imagine相同架构原则的流处理器,说明了流概念的灵活性、通用性和可扩展性。本文描述了流架构、流编程系统和流应用。与传统的dsp、fpga和ASIC解决方案进行了比较。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信