矩阵应用的浮点累加电路

2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines Pub Date : 2006-04-24 DOI:10.1109/FCCM.2006.41

M.R. Bodnar, J. Humphrey, P. Curt, J. Durbano, D. Prather

{"title":"矩阵应用的浮点累加电路","authors":"M.R. Bodnar, J. Humphrey, P. Curt, J. Durbano, D. Prather","doi":"10.1109/FCCM.2006.41","DOIUrl":null,"url":null,"abstract":"Many scientific algorithms require floating-point reduction operations, or accumulations, including matrix-vector-multiply (MVM), vector dot-products, and the discrete cosine transform (DCT). Because FPGA implementations of each of these algorithms are desirable, it is clear that a high-performance, floatingpoint accumulation unit is necessary. However, this type of circuit is difficult to design in an FPGA environment due to the deep pipelining of the floatingpoint arithmetic units, which is needed in order to attain high performance designs (Durbano et al., 2004, Leeser and Wang, 2004). A deep pipeline requires special handling in feedback circuits because of the long delay, which is further complicated by a continuous input data stream. Proposed accumulator architectures, which overcome such performance bottlenecks, are described in Zuo et al. (2005) and Zuo and Prassana (2005). This paper presents a floating-point accumulation circuit that is a natural evolution of this work. The system can handle streams of arbitrary length, requires modest area, and can handle interrupted data inputs. In contrast to the designs proposed by Zhuo et al., the proposed architecture maintains buffers for partial result storage which utilize significantly less embedded memory resources, while maintaining fixed size and speed characteristics, regardless of stream length. The results for both single- and double-precision accumulation architectures was verified in a Virtex-II 8000-4 part clocked at more than 150 MHz, and the power of this design was demonstrated in a computationally intense, matrix-matrix-multiply application","PeriodicalId":123057,"journal":{"name":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-04-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"20","resultStr":"{\"title\":\"Floating-Point Accumulation Circuit for Matrix Applications\",\"authors\":\"M.R. Bodnar, J. Humphrey, P. Curt, J. Durbano, D. Prather\",\"doi\":\"10.1109/FCCM.2006.41\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Many scientific algorithms require floating-point reduction operations, or accumulations, including matrix-vector-multiply (MVM), vector dot-products, and the discrete cosine transform (DCT). Because FPGA implementations of each of these algorithms are desirable, it is clear that a high-performance, floatingpoint accumulation unit is necessary. However, this type of circuit is difficult to design in an FPGA environment due to the deep pipelining of the floatingpoint arithmetic units, which is needed in order to attain high performance designs (Durbano et al., 2004, Leeser and Wang, 2004). A deep pipeline requires special handling in feedback circuits because of the long delay, which is further complicated by a continuous input data stream. Proposed accumulator architectures, which overcome such performance bottlenecks, are described in Zuo et al. (2005) and Zuo and Prassana (2005). This paper presents a floating-point accumulation circuit that is a natural evolution of this work. The system can handle streams of arbitrary length, requires modest area, and can handle interrupted data inputs. In contrast to the designs proposed by Zhuo et al., the proposed architecture maintains buffers for partial result storage which utilize significantly less embedded memory resources, while maintaining fixed size and speed characteristics, regardless of stream length. The results for both single- and double-precision accumulation architectures was verified in a Virtex-II 8000-4 part clocked at more than 150 MHz, and the power of this design was demonstrated in a computationally intense, matrix-matrix-multiply application\",\"PeriodicalId\":123057,\"journal\":{\"name\":\"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines\",\"volume\":\"33 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2006-04-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"20\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/FCCM.2006.41\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FCCM.2006.41","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 20

摘要

许多科学算法需要浮点约简运算或累积，包括矩阵向量乘法(MVM)、向量点积和离散余弦变换(DCT)。由于每种算法都需要FPGA实现，因此显然需要一个高性能的浮点累加单元。然而，这种类型的电路很难在FPGA环境中设计，因为浮点运算单元的深度流水线是实现高性能设计所必需的(Durbano等人，2004年，Leeser和Wang, 2004年)。由于长时间的延迟，深管道需要在反馈电路中进行特殊处理，而连续输入数据流使其进一步复杂化。Zuo et al.(2005)和Zuo and Prassana(2005)描述了克服此类性能瓶颈的蓄能器架构。本文提出了一种浮点累加电路，它是这项工作的自然演变。该系统可以处理任意长度的流，需要适度的面积，并可以处理中断的数据输入。与Zhuo等人提出的设计相反，所提出的架构为部分结果存储维护缓冲区，这大大减少了嵌入式内存资源，同时保持固定的大小和速度特性，无论流长度如何。在频率超过150 MHz的Virtex-II 8000-4部件上验证了单精度和双精度积累架构的结果，并在计算强度大的矩阵-矩阵-乘法应用中证明了该设计的强大功能

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Floating-Point Accumulation Circuit for Matrix Applications

Many scientific algorithms require floating-point reduction operations, or accumulations, including matrix-vector-multiply (MVM), vector dot-products, and the discrete cosine transform (DCT). Because FPGA implementations of each of these algorithms are desirable, it is clear that a high-performance, floatingpoint accumulation unit is necessary. However, this type of circuit is difficult to design in an FPGA environment due to the deep pipelining of the floatingpoint arithmetic units, which is needed in order to attain high performance designs (Durbano et al., 2004, Leeser and Wang, 2004). A deep pipeline requires special handling in feedback circuits because of the long delay, which is further complicated by a continuous input data stream. Proposed accumulator architectures, which overcome such performance bottlenecks, are described in Zuo et al. (2005) and Zuo and Prassana (2005). This paper presents a floating-point accumulation circuit that is a natural evolution of this work. The system can handle streams of arbitrary length, requires modest area, and can handle interrupted data inputs. In contrast to the designs proposed by Zhuo et al., the proposed architecture maintains buffers for partial result storage which utilize significantly less embedded memory resources, while maintaining fixed size and speed characteristics, regardless of stream length. The results for both single- and double-precision accumulation architectures was verified in a Virtex-II 8000-4 part clocked at more than 150 MHz, and the power of this design was demonstrated in a computationally intense, matrix-matrix-multiply application

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2006 14th Annual IEEE Symposium on Field-Programmable Custom Computing Machines

自引率

0.00%

发文量