高端可重构计算系统BEE2的设计与应用

2005 IEEE Hot Chips XVII Symposium (HCS) Pub Date : 2005-08-01 DOI:10.1109/HOTCHIPS.2005.7476601

Chen Chang, J. Wawrzynek, B. Brodersen

{"title":"高端可重构计算系统BEE2的设计与应用","authors":"Chen Chang, J. Wawrzynek, B. Brodersen","doi":"10.1109/HOTCHIPS.2005.7476601","DOIUrl":null,"url":null,"abstract":"This paper summarizes our effort to design and construct a high-end reconfigurable computer (HERC) system based solely on field programmable gate arrays (FPGAs) as the processing elements. FPGAs offer many important potential advantages over conventional microprocessors and digital signal processors (DSP), such as flexible arithmetic precision, higher computational density per unit silicon area, and lower power consumption. The programmable interconnect structure unique to FPGA technology makes it possible to tailor a HERC system, such as our BEE2 system, on a per-problem basis to best take advantage of task specific dataflow, memory access patterns, and node-to-node communication patterns. Our BEE2 project is a coordinated attack on the elements needed to demonstrate a practical, cost-effective, high-end reconfigurable computer: the design of a processing module to be used as the building block for a family of high-end reconfigurable computers; the development of several programming models; and the demonstration of the efficiency of the machine on a set of demanding applications, ranging from high-performance digital signal processing and communication systems to traditional scientific computing. On selected DSP applications, BEE2 can provide over 100 times more computing throughput than a microprocessor-based system with similar power consumption and cost. There are several computationally intensive problems central to the research objectives of BWRC that we are using as an application benchmark set and design drivers for the specification of the BEE2 machine architecture and its associated software mapping tools. These applications fall into four broad categories: high-performance real-time digital signal processing, emulation and design of novel wireless communications systems, real-time scientific computation and simulation, and acceleration of computer aided-design (CAD) tools. Due to the diverse application domains targeted by the BEE2 system, any single programming model would not be optimal for all applications; hence the need for domain specific programming models that can fully exploit the computing power of the BEE2 system. Currently the most mature programming model for the BEE2 system is the synchronous data flow model for DSP and communication applications. Commercial tools, including Mathworks Matlab/Simulink, Xilinx System Generator, along with automation tools developed at BWRC, provide automatic mapping from high-level block diagrams and state machine specifications to FPGA configurations. This programming model and tool flow has proven very successful on a variety of projects at BWRC, particularly in the areas of DSP and other datapath intensive streaming applications. To extend this model to support BEE2 specific hardware, stream-based design abstractions are currently being developed for external DRAMs and global communication networks. We have completed the design and fabrication of a compute module comprising five Xilinx XC2VP70 FPGAs, 20 DRAM DIMMs, and 18 off-module 10Gbit/s Infiniband/Ethernet connections, shown below in figure 1. This module has peak performance in the 1-2 TeraOp/s range (integer operations), and forms the basic building block for larger systems, scalable from 1 to 100's of modules. To date, our most extensive application development has been in collaboration with the SETI@HOME, SERENDIP project at UC Berkeley Space Science Laboratory and the UC Berkeley Radio Astronomy Laboratory. We have successfully demonstrated an 800MHz billionchannel spectrometer using the BEE2 system on a single antenna. We have analyzed the performance of our FPGA-based approach on this and other Radio Astronomy applications. In terms of computational throughput per chip, the FPGAs in the BEE2 system outperform a 720MHz DSP by a factor of 10 to 34, a 1GHz (90nm) DSP by a factor of 7 to 25, and the latest Pentium-4 by a factor of 4 to 13. In terms of power efficiency, the XC2VP70 FPGA delivers 72% to 106% more throughput on 16bit operations comparing to the DSPs, and more than 11 times on 4-bit operations. When compared to the microprocessor, the FPGA is over 100 times more power efficient. Similarly, the compute throughput per unit chip cost of FPGAs is 20% to 307% more than the 1 GHz DSP, and 50% to 505% more than the 3.8GHz Pentium-4 processor. We are currently developing more advanced Radio Astronomy applications. By the end of summer 2005 (in time for the symposium), we expect to have an 8-antenna correlator system prototype running on the Green Bank Telescope dipole antenna array. We plan to develop a similar correlator for the Allen Telescope Array (ATA) with 32 antennas in the second half of 2005. For the final 350 antenna version of the ATA, 121 BEE2 modules will be employed providing an aggregate computational throughput of over 200 TeraOp/s. Figure 1: Compute Module Architecture Diagram and Photo","PeriodicalId":357616,"journal":{"name":"2005 IEEE Hot Chips XVII Symposium (HCS)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":"{\"title\":\"The design and applications of BEE2: A high end reconfigurable computing system\",\"authors\":\"Chen Chang, J. Wawrzynek, B. Brodersen\",\"doi\":\"10.1109/HOTCHIPS.2005.7476601\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper summarizes our effort to design and construct a high-end reconfigurable computer (HERC) system based solely on field programmable gate arrays (FPGAs) as the processing elements. FPGAs offer many important potential advantages over conventional microprocessors and digital signal processors (DSP), such as flexible arithmetic precision, higher computational density per unit silicon area, and lower power consumption. The programmable interconnect structure unique to FPGA technology makes it possible to tailor a HERC system, such as our BEE2 system, on a per-problem basis to best take advantage of task specific dataflow, memory access patterns, and node-to-node communication patterns. Our BEE2 project is a coordinated attack on the elements needed to demonstrate a practical, cost-effective, high-end reconfigurable computer: the design of a processing module to be used as the building block for a family of high-end reconfigurable computers; the development of several programming models; and the demonstration of the efficiency of the machine on a set of demanding applications, ranging from high-performance digital signal processing and communication systems to traditional scientific computing. On selected DSP applications, BEE2 can provide over 100 times more computing throughput than a microprocessor-based system with similar power consumption and cost. There are several computationally intensive problems central to the research objectives of BWRC that we are using as an application benchmark set and design drivers for the specification of the BEE2 machine architecture and its associated software mapping tools. These applications fall into four broad categories: high-performance real-time digital signal processing, emulation and design of novel wireless communications systems, real-time scientific computation and simulation, and acceleration of computer aided-design (CAD) tools. Due to the diverse application domains targeted by the BEE2 system, any single programming model would not be optimal for all applications; hence the need for domain specific programming models that can fully exploit the computing power of the BEE2 system. Currently the most mature programming model for the BEE2 system is the synchronous data flow model for DSP and communication applications. Commercial tools, including Mathworks Matlab/Simulink, Xilinx System Generator, along with automation tools developed at BWRC, provide automatic mapping from high-level block diagrams and state machine specifications to FPGA configurations. This programming model and tool flow has proven very successful on a variety of projects at BWRC, particularly in the areas of DSP and other datapath intensive streaming applications. To extend this model to support BEE2 specific hardware, stream-based design abstractions are currently being developed for external DRAMs and global communication networks. We have completed the design and fabrication of a compute module comprising five Xilinx XC2VP70 FPGAs, 20 DRAM DIMMs, and 18 off-module 10Gbit/s Infiniband/Ethernet connections, shown below in figure 1. This module has peak performance in the 1-2 TeraOp/s range (integer operations), and forms the basic building block for larger systems, scalable from 1 to 100's of modules. To date, our most extensive application development has been in collaboration with the SETI@HOME, SERENDIP project at UC Berkeley Space Science Laboratory and the UC Berkeley Radio Astronomy Laboratory. We have successfully demonstrated an 800MHz billionchannel spectrometer using the BEE2 system on a single antenna. We have analyzed the performance of our FPGA-based approach on this and other Radio Astronomy applications. In terms of computational throughput per chip, the FPGAs in the BEE2 system outperform a 720MHz DSP by a factor of 10 to 34, a 1GHz (90nm) DSP by a factor of 7 to 25, and the latest Pentium-4 by a factor of 4 to 13. In terms of power efficiency, the XC2VP70 FPGA delivers 72% to 106% more throughput on 16bit operations comparing to the DSPs, and more than 11 times on 4-bit operations. When compared to the microprocessor, the FPGA is over 100 times more power efficient. Similarly, the compute throughput per unit chip cost of FPGAs is 20% to 307% more than the 1 GHz DSP, and 50% to 505% more than the 3.8GHz Pentium-4 processor. We are currently developing more advanced Radio Astronomy applications. By the end of summer 2005 (in time for the symposium), we expect to have an 8-antenna correlator system prototype running on the Green Bank Telescope dipole antenna array. We plan to develop a similar correlator for the Allen Telescope Array (ATA) with 32 antennas in the second half of 2005. For the final 350 antenna version of the ATA, 121 BEE2 modules will be employed providing an aggregate computational throughput of over 200 TeraOp/s. Figure 1: Compute Module Architecture Diagram and Photo\",\"PeriodicalId\":357616,\"journal\":{\"name\":\"2005 IEEE Hot Chips XVII Symposium (HCS)\",\"volume\":\"8 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2005-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"15\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2005 IEEE Hot Chips XVII Symposium (HCS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/HOTCHIPS.2005.7476601\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2005 IEEE Hot Chips XVII Symposium (HCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HOTCHIPS.2005.7476601","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 15

摘要

本文总结了我们设计和构建一个基于现场可编程门阵列(fpga)作为处理元件的高端可重构计算机(HERC)系统的努力。与传统的微处理器和数字信号处理器(DSP)相比，fpga具有许多重要的潜在优势，例如灵活的算术精度、单位硅面积更高的计算密度和更低的功耗。FPGA技术特有的可编程互连结构使得在每个问题的基础上定制HERC系统(例如我们的BEE2系统)成为可能，从而最好地利用任务特定的数据流、内存访问模式和节点到节点通信模式。我们的BEE2项目是对展示实用、经济高效、高端可重构计算机所需元素的协同攻击:设计一个处理模块，用作一系列高端可重构计算机的构建块;几种编程模型的开发;并在从高性能数字信号处理和通信系统到传统科学计算等一系列要求苛刻的应用中展示了该机器的效率。在选定的DSP应用中，BEE2可以提供比基于微处理器的系统多100倍以上的计算吞吐量，功耗和成本相似。有几个计算密集型问题是BWRC研究目标的核心，我们将其用作BEE2机器架构及其相关软件映射工具规范的应用基准集和设计驱动程序。这些应用可分为四大类:高性能实时数字信号处理、新型无线通信系统的仿真与设计、实时科学计算与仿真以及计算机辅助设计(CAD)工具的加速。由于BEE2系统针对不同的应用程序领域，任何单一的编程模型都不适合所有应用程序;因此需要能够充分利用BEE2系统计算能力的领域特定编程模型。目前BEE2系统最成熟的编程模型是DSP和通信应用的同步数据流模型。商业工具，包括Mathworks Matlab/Simulink, Xilinx System Generator，以及BWRC开发的自动化工具，提供从高级框图和状态机规范到FPGA配置的自动映射。这种编程模型和工具流已经在BWRC的各种项目中被证明是非常成功的，特别是在DSP和其他数据路径密集型流应用领域。为了扩展此模型以支持BEE2特定的硬件，目前正在为外部dram和全球通信网络开发基于流的设计抽象。我们已经完成了一个计算模块的设计和制造，该模块包括5个Xilinx XC2VP70 fpga, 20个DRAM dimm和18个模块外10Gbit/s Infiniband/以太网连接，如图1所示。该模块的峰值性能在1-2 TeraOp/s范围内(整数运算)，并形成了大型系统的基本构建块，可扩展到1到100个模块。迄今为止，我们最广泛的应用开发是与加州大学伯克利分校空间科学实验室的SETI@HOME SERENDIP项目和加州大学伯克利分校射电天文学实验室合作进行的。我们已经成功地在单天线上使用BEE2系统演示了800MHz十亿通道光谱仪。我们分析了基于fpga的方法在这个和其他射电天文学应用中的性能。就每个芯片的计算吞吐量而言，BEE2系统中的fpga比720MHz的DSP性能高10到34倍，比1GHz (90nm)的DSP性能高7到25倍，比最新的Pentium-4性能高4到13倍。在功率效率方面，与dsp相比，XC2VP70 FPGA在16位操作上的吞吐量提高72%至106%，在4位操作上的吞吐量提高11倍以上。与微处理器相比，FPGA的功耗效率高出100倍以上。同样，fpga的每单位芯片成本的计算吞吐量比1 GHz DSP高20%到307%，比3.8GHz Pentium-4处理器高50%到505%。我们目前正在开发更先进的射电天文学应用。到2005年夏末(正好赶上研讨会)，我们期望在绿岸望远镜偶极子天线阵列上运行一个8天线相关器系统原型。我们计划在2005年下半年为拥有32个天线的艾伦望远镜阵列(ATA)开发一个类似的相关器。对于最终的350天线版本的ATA，将采用121个BEE2模块，提供超过200 TeraOp/s的总计算吞吐量。图1:计算模块架构图及图片

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

The design and applications of BEE2: A high end reconfigurable computing system

This paper summarizes our effort to design and construct a high-end reconfigurable computer (HERC) system based solely on field programmable gate arrays (FPGAs) as the processing elements. FPGAs offer many important potential advantages over conventional microprocessors and digital signal processors (DSP), such as flexible arithmetic precision, higher computational density per unit silicon area, and lower power consumption. The programmable interconnect structure unique to FPGA technology makes it possible to tailor a HERC system, such as our BEE2 system, on a per-problem basis to best take advantage of task specific dataflow, memory access patterns, and node-to-node communication patterns. Our BEE2 project is a coordinated attack on the elements needed to demonstrate a practical, cost-effective, high-end reconfigurable computer: the design of a processing module to be used as the building block for a family of high-end reconfigurable computers; the development of several programming models; and the demonstration of the efficiency of the machine on a set of demanding applications, ranging from high-performance digital signal processing and communication systems to traditional scientific computing. On selected DSP applications, BEE2 can provide over 100 times more computing throughput than a microprocessor-based system with similar power consumption and cost. There are several computationally intensive problems central to the research objectives of BWRC that we are using as an application benchmark set and design drivers for the specification of the BEE2 machine architecture and its associated software mapping tools. These applications fall into four broad categories: high-performance real-time digital signal processing, emulation and design of novel wireless communications systems, real-time scientific computation and simulation, and acceleration of computer aided-design (CAD) tools. Due to the diverse application domains targeted by the BEE2 system, any single programming model would not be optimal for all applications; hence the need for domain specific programming models that can fully exploit the computing power of the BEE2 system. Currently the most mature programming model for the BEE2 system is the synchronous data flow model for DSP and communication applications. Commercial tools, including Mathworks Matlab/Simulink, Xilinx System Generator, along with automation tools developed at BWRC, provide automatic mapping from high-level block diagrams and state machine specifications to FPGA configurations. This programming model and tool flow has proven very successful on a variety of projects at BWRC, particularly in the areas of DSP and other datapath intensive streaming applications. To extend this model to support BEE2 specific hardware, stream-based design abstractions are currently being developed for external DRAMs and global communication networks. We have completed the design and fabrication of a compute module comprising five Xilinx XC2VP70 FPGAs, 20 DRAM DIMMs, and 18 off-module 10Gbit/s Infiniband/Ethernet connections, shown below in figure 1. This module has peak performance in the 1-2 TeraOp/s range (integer operations), and forms the basic building block for larger systems, scalable from 1 to 100's of modules. To date, our most extensive application development has been in collaboration with the SETI@HOME, SERENDIP project at UC Berkeley Space Science Laboratory and the UC Berkeley Radio Astronomy Laboratory. We have successfully demonstrated an 800MHz billionchannel spectrometer using the BEE2 system on a single antenna. We have analyzed the performance of our FPGA-based approach on this and other Radio Astronomy applications. In terms of computational throughput per chip, the FPGAs in the BEE2 system outperform a 720MHz DSP by a factor of 10 to 34, a 1GHz (90nm) DSP by a factor of 7 to 25, and the latest Pentium-4 by a factor of 4 to 13. In terms of power efficiency, the XC2VP70 FPGA delivers 72% to 106% more throughput on 16bit operations comparing to the DSPs, and more than 11 times on 4-bit operations. When compared to the microprocessor, the FPGA is over 100 times more power efficient. Similarly, the compute throughput per unit chip cost of FPGAs is 20% to 307% more than the 1 GHz DSP, and 50% to 505% more than the 3.8GHz Pentium-4 processor. We are currently developing more advanced Radio Astronomy applications. By the end of summer 2005 (in time for the symposium), we expect to have an 8-antenna correlator system prototype running on the Green Bank Telescope dipole antenna array. We plan to develop a similar correlator for the Allen Telescope Array (ATA) with 32 antennas in the second half of 2005. For the final 350 antenna version of the ATA, 121 BEE2 modules will be employed providing an aggregate computational throughput of over 200 TeraOp/s. Figure 1: Compute Module Architecture Diagram and Photo

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2005 IEEE Hot Chips XVII Symposium (HCS)

自引率

0.00%

发文量