2011 IEEE 20th Symposium on Computer Arithmetic最新文献

筛选
英文 中文
A Prescale-Lookup-Postscale Additive Procedure for Obtaining a Single Precision Ulp Accurate Reciprocal 获得单精度Ulp精确倒数的尺度前-查找-尺度后加性方法
2011 IEEE 20th Symposium on Computer Arithmetic Pub Date : 2011-07-25 DOI: 10.1109/ARITH.2011.31
D. Matula, Mihai T. Panu
{"title":"A Prescale-Lookup-Postscale Additive Procedure for Obtaining a Single Precision Ulp Accurate Reciprocal","authors":"D. Matula, Mihai T. Panu","doi":"10.1109/ARITH.2011.31","DOIUrl":"https://doi.org/10.1109/ARITH.2011.31","url":null,"abstract":"We investigate the utilization of additive prescaling for reducing the input range for determining a divisor reciprocal approximation. In particular, we demonstrate that a single precision (24-bit) ulp accurate monotonic approximate reciprocal operation can be obtained employing three back-to-back three term additions and bipartite table lookup with total table size less than 6 Kbytes.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132865746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Latency Sensitive FMA Design 延迟敏感FMA设计
2011 IEEE 20th Symposium on Computer Arithmetic Pub Date : 2011-07-25 DOI: 10.1109/ARITH.2011.26
Sameh Galal, M. Horowitz
{"title":"Latency Sensitive FMA Design","authors":"Sameh Galal, M. Horowitz","doi":"10.1109/ARITH.2011.26","DOIUrl":"https://doi.org/10.1109/ARITH.2011.26","url":null,"abstract":"The implementation of merged floating-point multiply-add operations can be optimized in many ways. For latency sensitive applications, our cascade design reduces the accumulation dependent latency by 2x over a fused design, at a cost of a 13% increase in non-accumulation dependent latency. A simple in-order execution model shows this design is superior in most applications, providing 12% average reduction in FP stalls, and improves performance by up to 6%. Simulations of superscalar out-of-order machines show 4% average improvement in CPI in 2-way machines and 4.6% in 4-way machines. The cascade design has the same area and energy budget as a traditional fused multiple-add FMA.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125719950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Self Checking in Current Floating-Point Units 当前浮点单位的自检
2011 IEEE 20th Symposium on Computer Arithmetic Pub Date : 2011-07-25 DOI: 10.1109/ARITH.2011.18
Daniel Lipetz, E. Schwarz
{"title":"Self Checking in Current Floating-Point Units","authors":"Daniel Lipetz, E. Schwarz","doi":"10.1109/ARITH.2011.18","DOIUrl":"https://doi.org/10.1109/ARITH.2011.18","url":null,"abstract":"High performance microprocessors are protected against transient and early end of life failures using a variety of error detection and fault isolation technologies. Execution units can be protected with duplication, parity prediction, or residue checking. Residue checking has an advantage due to its small size. A modulus is selected based on the radix of the numbers being checked. In a decimal floating-point unit there are two types of numbers in different bases. There are base 10 decimal numbers and base 2 integers being used. A residue checking system that makes it easy to check both base 2 and 10 numbers is discussed. Current state of the art designs that are currently in use are described as well as a novel hybrid moduli 9 and 3 residue system. The checking systems for the decimal and binary floating-point units of some recent IBM microprocessors including the Power6, Power7, z10, and z196 microprocessors are detailed.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122774776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
The IBM zEnterprise-196 Decimal Floating-Point Accelerator IBM zEnterprise-196十进制浮点加速器
2011 IEEE 20th Symposium on Computer Arithmetic Pub Date : 2011-07-25 DOI: 10.1109/ARITH.2011.27
S. Carlough, Adam Collura, S. M. Müller, M. Kroener
{"title":"The IBM zEnterprise-196 Decimal Floating-Point Accelerator","authors":"S. Carlough, Adam Collura, S. M. Müller, M. Kroener","doi":"10.1109/ARITH.2011.27","DOIUrl":"https://doi.org/10.1109/ARITH.2011.27","url":null,"abstract":"Decimal floating-point Arithmetic is widely used in commercial computing applications, such as financial transactions, where rounding errors prevent the use of binary floating-point operations. The revised IEEE Standard for Floating-Point Arithmetic (IEEE-754-2008) defined standardized decimal floating-point (DFP) formats. As more software applications adopt the IEEE decimal floating-point standard, hardware accelerators that support it are becoming more prevalent. This paper describes the second generation decimal floating-point accelerator implemented on the IBM zEnterprise-196 processor. The 4-cycle deep pipeline was designed to optimize the latency of fixed-point decimal operations while significantly improving the bandwidth of DFP operations. A detailed description of the unit and a comparison to previous implementations found in literature is provided.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123775259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 47
On the Fixed-Point Accuracy Analysis and Optimization of FFT Units with CORDIC Multipliers 带CORDIC乘法器的FFT单元不动点精度分析与优化
2011 IEEE 20th Symposium on Computer Arithmetic Pub Date : 2011-07-25 DOI: 10.1109/ARITH.2011.17
O. Sarbishei, K. Radecka
{"title":"On the Fixed-Point Accuracy Analysis and Optimization of FFT Units with CORDIC Multipliers","authors":"O. Sarbishei, K. Radecka","doi":"10.1109/ARITH.2011.17","DOIUrl":"https://doi.org/10.1109/ARITH.2011.17","url":null,"abstract":"Fixed-point Fast Fourier Transform (FFT) units are widely used in digital communication systems. The twiddle multipliers required for realizing large FFTs are typically implemented with the Coordinate Rotation Digital Computer (CORDIC) algorithm to restrict memory requirements. Recent approaches aiming to optimize the bit-widths of FFT units while satisfying a given maximum bound on Mean-Square-Error (MSE) mostly focus on the architectures with integer multipliers. They ignore the quantization error of coefficients, disabling them to analyze the exact error defined as the difference between the fixed-point circuit and the reference floating-point model. This paper presents an efficient analysis of MSE as well as an optimization algorithm for CORDIC-based FFT units, which is applicable to other Linear-Time-Invariant (LTI) circuits as well.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126482062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Accelerating Computations on FPGA Carry Chains by Operand Compaction 通过操作数压缩加速FPGA进位链的计算
2011 IEEE 20th Symposium on Computer Arithmetic Pub Date : 2011-07-25 DOI: 10.1109/ARITH.2011.22
Thomas B. Preußer, M. Zabel, R. Spallek
{"title":"Accelerating Computations on FPGA Carry Chains by Operand Compaction","authors":"Thomas B. Preußer, M. Zabel, R. Spallek","doi":"10.1109/ARITH.2011.22","DOIUrl":"https://doi.org/10.1109/ARITH.2011.22","url":null,"abstract":"This work describes the carry-compact addition (CCA), a novel addition scheme that allows the acceleration of carry-chain computations on contemporary FPGA devices. While based on concepts known from the carry-look ahead addition and from parallel prefix adders, their adaptation by the CCA takes the context of an FPGA as implementation environment into account. These typically provide carry-chain structures to accelerate the simple ripple-carry addition (RCA). Rather than contrasting this scheme with the hierarchical addition approaches favored in hard-core VLSI designs, the CCA combines the benefits of both and uses hierarchical structures to shorten the critical path, which is still left on a core carry chain. In contrast to previous studies examining the asymptotically superior parallel prefix adders on FPGAs, the CCA is shown to outperform the standard RCA already for operand widths starting at 50~bits. Wider adders such as used in extended-precision floating-point units and in cryptographic applications even benefit from increasing speedups. The concrete mapping of the CCA as achieved for current Xilinx and Altera architectures is described and shown to be very favorable so as to yield a high speedup for a very modest investment of additional LUT resources.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127989783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Efficient SIMD Arithmetic Modulo a Mersenne Number 高效SIMD算术模梅森数
2011 IEEE 20th Symposium on Computer Arithmetic Pub Date : 2011-07-25 DOI: 10.1109/ARITH.2011.37
Joppe W. Bos, T. Kleinjung, A. Lenstra, P. L. Montgomery
{"title":"Efficient SIMD Arithmetic Modulo a Mersenne Number","authors":"Joppe W. Bos, T. Kleinjung, A. Lenstra, P. L. Montgomery","doi":"10.1109/ARITH.2011.37","DOIUrl":"https://doi.org/10.1109/ARITH.2011.37","url":null,"abstract":"This paper describes carry-less arithmetic operations modulo an integer 2^M-1 in the thousand-bit range, targeted at single instruction multiple data platforms and applications where overall throughput is the main performance criterion. Using an implementation on a cluster of PlayStation 3 game consoles a new record was set for the elliptic curve method for integer factorization.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130783691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Teraflop FPGA Design Teraflop FPGA设计
2011 IEEE 20th Symposium on Computer Arithmetic Pub Date : 2011-07-25 DOI: 10.1109/ARITH.2011.32
M. Langhammer
{"title":"Teraflop FPGA Design","authors":"M. Langhammer","doi":"10.1109/ARITH.2011.32","DOIUrl":"https://doi.org/10.1109/ARITH.2011.32","url":null,"abstract":"User requirements for signal processing have increased in line with, or greater than, the increase in FPGA resources and capability. Many current signal processing algorithms require floating point, especially for military applications such as radar. Also, the increasing system complexity of these designs necessitate increased designer productivity, and floating point allows an easier implementation of the system model than the fixed point arithmetic that FPGA devices have been traditionally architected for. This article will review devices and methods for achieving consistent high performance system implementations in floating point. Single device designs at over 200 GFLOPs at the 40nm node, and approaching 1 Teraflop at 28nm will be described.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122410238","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Automatic Generation of Code for the Evaluation of Constant Expressions at Any Precision with a Guaranteed Error Bound 具有保证误差范围的任意精度常数表达式求值代码的自动生成
2011 IEEE 20th Symposium on Computer Arithmetic Pub Date : 2011-07-25 DOI: 10.1109/ARITH.2011.38
S. Chevillard
{"title":"Automatic Generation of Code for the Evaluation of Constant Expressions at Any Precision with a Guaranteed Error Bound","authors":"S. Chevillard","doi":"10.1109/ARITH.2011.38","DOIUrl":"https://doi.org/10.1109/ARITH.2011.38","url":null,"abstract":"The evaluation of special functions often involves the evaluation of numerical constants. When the precision of the evaluation is known in advance (e.g., when developing libms) these constants are simply precomputed once and for all. In contrast, when the precision is dynamically chosen by the user (e.g., in multiple precision libraries), the constants must be evaluated on the fly at the required precision and with a rigorous error bound. Often, such constants are easily obtained by means of formulas involving simple numbers and functions. In principle, it is not a difficult work to write multiple precision code for evaluating such formulas with a rigorous round off analysis: one only has to study how round off errors propagate through sub expressions. However, this work is painful and error-prone and it is difficult for a human being to be perfectly rigorous in this process. Moreover, the task quickly becomes impractical when the size of the formula grows. In this article, we present an algorithm that takes as input a constant formula and that automatically produces code for evaluating it in arbitrary precision with a rigorous error bound. It has been implemented in the Solly a free software tool and its behavior is illustrated on several examples.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130159002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerating Large-Scale HPC Applications Using FPGAs 使用fpga加速大规模HPC应用
2011 IEEE 20th Symposium on Computer Arithmetic Pub Date : 2011-07-25 DOI: 10.1109/ARITH.2011.34
R. Dimond, S. Racanière, O. Pell
{"title":"Accelerating Large-Scale HPC Applications Using FPGAs","authors":"R. Dimond, S. Racanière, O. Pell","doi":"10.1109/ARITH.2011.34","DOIUrl":"https://doi.org/10.1109/ARITH.2011.34","url":null,"abstract":"Field Programmable Gate Arrays (FPGAs) are conventionally considered as 'glue-logic'. However, modern FPGAs are extremely competitive compared to state-of-the-art CPUs for commercial HPC workloads, such as those found in Oil and Gas and Finance. For example, an FPGA accelerated system can be 31-37 times faster than an equivalently sized conventional machine, and consume 1/39 of the power. The key to achieving the best performance in FPGA accelerators, while maintaining correctness, is optimization of arithmetic units and data types to suit the range/precision at each point in the computation. The flexibility of the FPGA to implement non-standard arithmetic, combined with a data-flow programming model that instantiates a separate unit for each arithmetic operator in the code provides a wide design space. As such, FPGA computing offers significant opportunity for arithmetic research into 'large scale' HPC applications, where there is an opportunity to move away from standard IEEE formats, either to improve precision compared to the CPU version or to increase speed.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130761756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信