{"title":"Low-Cost Duplicate Multiplication","authors":"Michael B. Sullivan, E. Swartzlander","doi":"10.1109/ARITH.2015.29","DOIUrl":"https://doi.org/10.1109/ARITH.2015.29","url":null,"abstract":"Rising levels of integration, decreasing component reliabilities, and the ubiquity of computer systems make error protection a rising concern. Meanwhile, the uncertainty of future fault and error modes motivates the design of strong error detection mechanisms that offer fault-agnostic error protection. Current concurrent hardware mechanisms, however, either offer strong error detection coverage at high cost or restrict their coverage to narrow synthetic error models. This paper investigates the potential for duplication using alternate number systems to lower the costs of duplicated multiplication without sacrificing error coverage. Two examples of such low-cost duplication schemes are described and evaluated, it is shown that specialized carry-save or residue number system checking can be used to increase the efficiency of duplicated multiplication.","PeriodicalId":6526,"journal":{"name":"2015 IEEE 22nd Symposium on Computer Arithmetic","volume":"16 1","pages":"2-9"},"PeriodicalIF":0.0,"publicationDate":"2015-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76020759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reliable Evaluation of the Worst-Case Peak Gain Matrix in Multiple Precision","authors":"Anastasia Volkova, Thibault Hilaire, C. Lauter","doi":"10.1109/ARITH.2015.14","DOIUrl":"https://doi.org/10.1109/ARITH.2015.14","url":null,"abstract":"The worst-case peak gain (WCPG) of a linear filter is an important measure for the implementation of signal processing algorithms. It is used in the error propagation analysis for filters, thus a reliable evaluation with controlled precision is required. The WCPG is computed as an infinite sum and has matrix powers in each summand. We propose a direct formula for the lower bound on truncation order of the infinite sum in dependency of desired truncation error. Several multiprecision methods for complex matrix operations are developed and their error analysis performed. A multiprecision matrix powering method is presented. All methods yield a rigorous solution with an absolute error bounded by an a priori given value. The results are illustrated with numerical examples.","PeriodicalId":6526,"journal":{"name":"2015 IEEE 22nd Symposium on Computer Arithmetic","volume":"1 1","pages":"96-103"},"PeriodicalIF":0.0,"publicationDate":"2015-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75374002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Exact Real Arithmetical Algorithm in Binary Continued Fractions","authors":"P. Kurka","doi":"10.1109/ARITH.2015.20","DOIUrl":"https://doi.org/10.1109/ARITH.2015.20","url":null,"abstract":"The exact real binary arithmetical algorithm is an on-line algorithm which computes the sum, product or ratio of two real numbers to arbitrary precision. The algorithm works in general Moebius number systems which represent real numbers by infinite products of Moebius transformations. We consider a number system of binary continued fractions in which this algorithm is computed faster than in the binary signed system. Moreover, the number system of binary continued fractions circumvents the problem of nonredundancy and slow convergence of continued fractions.","PeriodicalId":6526,"journal":{"name":"2015 IEEE 22nd Symposium on Computer Arithmetic","volume":"44 1","pages":"168-175"},"PeriodicalIF":0.0,"publicationDate":"2015-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76972742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Divide-and-Conquer Multiprecision Integer Division","authors":"William Bruce Hart","doi":"10.1109/ARITH.2015.19","DOIUrl":"https://doi.org/10.1109/ARITH.2015.19","url":null,"abstract":"We present a new divide-and-conquer algorithm for mid-range multiprecision integer division which is typically 20-25% faster than the recent algorithms of Moller and Granlund implemented in the GNU Multi Precision (GMP) library.","PeriodicalId":6526,"journal":{"name":"2015 IEEE 22nd Symposium on Computer Arithmetic","volume":"48 1","pages":"90-95"},"PeriodicalIF":0.0,"publicationDate":"2015-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80827653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Semi-Automatic Floating-Point Implementation of Special Functions","authors":"C. Lauter, M. Mezzarobba","doi":"10.1109/ARITH.2015.12","DOIUrl":"https://doi.org/10.1109/ARITH.2015.12","url":null,"abstract":"This work introduces an approach to the computer-assisted implementation of mathematical functions geared toward special functions such as those occurring in mathematical physics. The general idea is to start with an exact symbolic representation of a function and automate as much as possible of the process of implementing it. In order to deal with a large class of special functions, our symbolic representation is an implicit one: the input is a linear differential equation with polynomial coefficients along with initial values. The output is a C program to evaluate the solution of the equation using domain splitting, argument reduction and polynomial approximations in double-precision arithmetic, in the usual style of mathematical libraries. Our generation method combines symbolic-numeric manipulations of linear ODEs with interval-based tools for the floating-point implementation of \"black-box\" functions. We describe a prototype code generator that can automatically produce implementations on moderately large intervals. Implementations on the whole real line are possible in some cases but require manual tool setup and code integration. Due to this limitation and as some heuristics remain, we refer to our method as \"semi-automatic\" at this stage. Along with other examples, we present an implementation of the Voigt profile with fixed parameters that may be of independent interest.","PeriodicalId":6526,"journal":{"name":"2015 IEEE 22nd Symposium on Computer Arithmetic","volume":"93 1","pages":"58-65"},"PeriodicalIF":0.0,"publicationDate":"2015-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84167115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modulo-(2^n -- 2^q -- 1) Parallel Prefix Addition via Excess-Modulo Encoding of Residues","authors":"Seyed Hamed Fatemi Langroudi, G. Jaberipur","doi":"10.1109/ARITH.2015.9","DOIUrl":"https://doi.org/10.1109/ARITH.2015.9","url":null,"abstract":"The residue number system t = {2<sup>n</sup> - 1, 2<sup>n</sup>, 2<sup>n</sup> + 1} has been extensively studied towards perfection in realization of efficient parallel prefix modular adders, with (3 + 2logn △G latency. Many applications, such as digital signal processing require fast modular operations. However, relying only on t limits the magnitude of n, and accordingly the dynamic range. Therefore, additional mutually prime moduli are required to accommodate for wider dynamic range. On the other hand, speed of modular arithmetic operations for the additional moduli should be as close as possible to those in t. This could be best met by the moduli of the form 2<sup>n</sup> - (2<sup>q</sup> + 1), with 1 ≤ q ≤ n - 2, such as 2<sup>n</sup> - 3, 2<sup>n</sup> - 5. However, the fastest parallel prefix realization of modulo-(2<sup>n</sup> - 2<sup>q</sup> - 1) adders that we have encountered in the relevant literature, claims (7 + 2 log n)△G latency. Motivated by the need to reduce the latter, we propose new designs of such adders with (5 + 2 log n)△G latency without any penalty in area consumption or power dissipation. The proposed modular addition algorithm entails supplementary representation of residues in [0,2<sup>q</sup>], as [2<sup>n</sup> - (2<sup>q</sup> + 1), 2<sup>n</sup> - 1]. This leads to additional performance efficiency similar to the effect of double zero representation in modulo-(2<sup>n</sup> - 1) adders. The aforementioned analytically evaluated speed gain and improvements in other figures of merit are also supported via circuit simulation and synthesis.","PeriodicalId":6526,"journal":{"name":"2015 IEEE 22nd Symposium on Computer Arithmetic","volume":"2 1","pages":"121-128"},"PeriodicalIF":0.0,"publicationDate":"2015-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83098151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Design and Implementation of an Embedded FPGA Floating Point DSP Block","authors":"M. Langhammer, B. Pasca","doi":"10.1109/ARITH.2015.18","DOIUrl":"https://doi.org/10.1109/ARITH.2015.18","url":null,"abstract":"This paper describes the architecture and implementation, from both the standpoint of target applications as well as circuit design, of an FPGA DSP Block that can efficiently support both fixed and single precision (SP) floating-point (FP) arithmetic. Most contemporary FPGAs embed DSP blocks that provide simple multiply-add-based fixed-point arithmetic cores. Current FP arithmetic FPGA solutions make use of these hardened DSP resources, together with embedded memory blocks and soft logic resources, however, larger systems cannot be efficiently implemented due to the routing and soft logic limitations on the devices, resulting in significant area, performance, and power consumption penalties compared to ASIC implementations. In this paper we analyse earlier proposed embedded FP implementations, and show why they are not suitable for a production FPGA. We contrast these against our solution -- a unified DSP Block -- where (a) the SP FP multiplier is overlaid on the fixed point constructs, (b) the SP FP Adder/Subtracter is integrated as a separate unit, and (c) the multiplier and adder can be combined in a way that is both arithmetically useful, but also efficient in terms of FPGA routing density and congestion. In addition, a novel way of seamlessly combining any number of DSP Blocks in a low latency structure will be introduced. We will show that this new approach allows a low cost, low power, and high density FP platform on current production 20nm FPGAs. We also describe a future enhancement of the DSP block that can support subnormal numbers.","PeriodicalId":6526,"journal":{"name":"2015 IEEE 22nd Symposium on Computer Arithmetic","volume":"128 1","pages":"26-33"},"PeriodicalIF":0.0,"publicationDate":"2015-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87912391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The end of numerical error","authors":"J. Gustafson","doi":"10.1109/ARITH.2015.34","DOIUrl":"https://doi.org/10.1109/ARITH.2015.34","url":null,"abstract":"Summary form only given, as follows. The full paper was not made available as part of this conference proceedings. It is time to overthrow a century of methods based on floating point arithmetic. Current technical computing is based on the acceptance of rounding error using numerical representations that were invented in 1914, and acceptance of sampling error using algorithms designed for a time when transistors were very expensive. By sticking to an antiquated storage format (now codified as an IEEE standard) well into the exascale area, we are wasting power, energy, storage, bandwidth, and programmer effort. The pursuit of exascale floating point is ridiculous, since we do not need to be making 10^18 sloppy rounding errors per second; we need instead to get provable, valid results for the first time, by turning the speed of parallel computers into higher quality answers instead of more junk per second. We introduce the 'unum' (universal number), a superset of IEEE Floating Point, that contains extra metadata fields that actually save storage, yet give more accurate answers that do not round, overflow, or underflow. The potential they offer for improved programmer productivity is enormous. They also provide, for the first time, the hope of a numerical standard that guarantees bitwise identical results across different computer architectures. Unum format is the basis for the 'ubox' method, which redefines what is meant by \"high performance\" by measuring performance in terms of the knowledge obtained about the answer and not the operations performed per second. Examples are given for practical application to structural analysis, radiation transfer, the n-body problem, linear and nonlinear systems of equations, and Laplace’s equation. This is a fresh approach to scientific computing that allows proper, rigorous representation of real number sets for the first time.","PeriodicalId":6526,"journal":{"name":"2015 IEEE 22nd Symposium on Computer Arithmetic","volume":"35 1","pages":"74"},"PeriodicalIF":0.0,"publicationDate":"2015-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89332574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A General-Purpose Method for Faithfully Rounded Floating-Point Function Approximation in FPGAs","authors":"David B. Thomas","doi":"10.1109/ARITH.2015.27","DOIUrl":"https://doi.org/10.1109/ARITH.2015.27","url":null,"abstract":"A barrier to wide-spread use of Field Programmable Gate Arrays (FPGAs) has been the complexity of programming, but recent advances in High-Level Synthesis (HLS) have made it possible for non-experts to easily create floating-point numerical accelerators from C-like code. However, HLS users are limited to the set of numerical primitives provided by HLS vendors and designers of floating-point IP cores, and cannot easily implement new fast or accurate numerical primitives. This paper presents a method for automatically creating high-performance pipelined floating-point function approximations, which can be integrated as IP cores into numerical accelerators, whether derived from HLS or traditional design methods. Both input and output are floating-point, but internally the function approximator uses fixed-point polynomial segments, guaranteeing a faithfully rounded output. A robust and automated non-uniform segmentation scheme is used to segment any twice-differentiable input function and produce platform-independent VHDL. The approach is demonstrated across ten functions, which are automatically generated then placed and routed in Xilinx devices. The method provides a 1.1x-3x improvement in area over composite numerical approximations, while providing similar performance and significantly better relative error.","PeriodicalId":6526,"journal":{"name":"2015 IEEE 22nd Symposium on Computer Arithmetic","volume":"13 1","pages":"42-49"},"PeriodicalIF":0.0,"publicationDate":"2015-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88001808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nicolas Brunie, F. D. Dinechin, O. Kupriianova, C. Lauter
{"title":"Code Generators for Mathematical Functions","authors":"Nicolas Brunie, F. D. Dinechin, O. Kupriianova, C. Lauter","doi":"10.1109/ARITH.2015.22","DOIUrl":"https://doi.org/10.1109/ARITH.2015.22","url":null,"abstract":"A typical floating-point environment includes support for a small set of about 30 mathematical functions such as exponential, logarithm, trigonometric and hyperbolic functions. These functions are provided by mathematical software libraries (libm), typically in IEEE754 single, double and quad precision. This article suggests to replace this libm paradigm by a more general approach: the on-demand generation of numerical function code, on arbitrary domains and with arbitrary accuracies. First, such code generation opens up the libm function space available to programmers. It may capture a much wider set of functions, and may capture even standard functions on non-standard domains and accuracy/performance points. Second, writing libm code requires fine-tuned instruction selection and scheduling for performance, and sophisticated floating-point techniques for accuracy. Automating this task through code generation improves confidence in the code while enabling better design space exploration, and therefore better time to market, even for the libm functions. This article discusses the new challenges of this paradigm shift, and presents the current state of open-source function code generators available on http://www.metalibm.org/.","PeriodicalId":6526,"journal":{"name":"2015 IEEE 22nd Symposium on Computer Arithmetic","volume":"12 1","pages":"66-73"},"PeriodicalIF":0.0,"publicationDate":"2015-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79130514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}