FPU Generator for Design Space Exploration

Sameh Galal, Ofer Shacham, J. Brunhaver, Jing Pu, A. Vassiliev, M. Horowitz
{"title":"FPU Generator for Design Space Exploration","authors":"Sameh Galal, Ofer Shacham, J. Brunhaver, Jing Pu, A. Vassiliev, M. Horowitz","doi":"10.1109/ARITH.2013.27","DOIUrl":null,"url":null,"abstract":"FPUs have been a topic of research for almost a century, leading to thousands of papers and books. Each advance focuses on the virtues of some specific new technique. This paper compares the energy efficiency of both throughput-optimized and latency-sensitive designs, each employing an array of optimization techniques, through a fair \"apples to apples\" methodology. This comparison required us to build many optimized FP units. We accomplished this by creating a highly parameterized FPgenerator, hierarchically encompassing lower-level generators for summation trees, Booth encoders, adders, etc. Having constructed this generator we quickly relearned a number of low-level issues that are critical and are often the most neglected by papers. By exploring cascade and fused multiply-add architectures across a variety of bit widths, summation trees, booth encoders, pipelining techniques, and pipe depths, we found that for most throughput based designs, a Booth-3 fused multiply-add architecture with a Wallace combining tree is optimal. For latency designs, we found that Booth-2 cascade multiply-add architectures are better. As we describe in the paper, Wallace is not always the optimal combining network due to wire delay and track count, and the precise way the CSA's are connected in the tree can make a larger difference than the type of tree used.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"178 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"37","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 IEEE 21st Symposium on Computer Arithmetic","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ARITH.2013.27","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 37

Abstract

FPUs have been a topic of research for almost a century, leading to thousands of papers and books. Each advance focuses on the virtues of some specific new technique. This paper compares the energy efficiency of both throughput-optimized and latency-sensitive designs, each employing an array of optimization techniques, through a fair "apples to apples" methodology. This comparison required us to build many optimized FP units. We accomplished this by creating a highly parameterized FPgenerator, hierarchically encompassing lower-level generators for summation trees, Booth encoders, adders, etc. Having constructed this generator we quickly relearned a number of low-level issues that are critical and are often the most neglected by papers. By exploring cascade and fused multiply-add architectures across a variety of bit widths, summation trees, booth encoders, pipelining techniques, and pipe depths, we found that for most throughput based designs, a Booth-3 fused multiply-add architecture with a Wallace combining tree is optimal. For latency designs, we found that Booth-2 cascade multiply-add architectures are better. As we describe in the paper, Wallace is not always the optimal combining network due to wire delay and track count, and the precise way the CSA's are connected in the tree can make a larger difference than the type of tree used.
设计空间探索的FPU发生器
近一个世纪以来,fpu一直是一个研究课题,产生了数千篇论文和书籍。每个进步都集中在某些特定新技术的优点上。本文比较了吞吐量优化和延迟敏感设计的能源效率,每个设计都采用一系列优化技术,通过公平的“苹果对苹果”方法。这种比较要求我们构建许多优化的FP单元。我们通过创建一个高度参数化的FPgenerator来实现这一点,它分层地包含了用于求和树、Booth编码器、加法器等的低级生成器。在构建了这个生成器之后,我们很快重新学习了一些低级别的问题,这些问题很重要,但往往是论文中最容易忽视的。通过探索跨各种位宽度、求和树、展位编码器、流水线技术和管道深度的级联和融合乘加架构,我们发现对于大多数基于吞吐量的设计,带有Wallace组合树的booth -3融合乘加架构是最佳的。对于延迟设计,我们发现Booth-2级联乘加架构更好。正如我们在论文中所描述的那样,由于线延迟和轨道计数,Wallace并不总是最优的组合网络,并且CSA在树中连接的精确方式可能比所使用的树类型产生更大的差异。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信