{"title":"Low-Cost Duplicate Multiplication","authors":"Michael B. Sullivan, E. Swartzlander","doi":"10.1109/ARITH.2015.29","DOIUrl":"https://doi.org/10.1109/ARITH.2015.29","url":null,"abstract":"Rising levels of integration, decreasing component reliabilities, and the ubiquity of computer systems make error protection a rising concern. Meanwhile, the uncertainty of future fault and error modes motivates the design of strong error detection mechanisms that offer fault-agnostic error protection. Current concurrent hardware mechanisms, however, either offer strong error detection coverage at high cost or restrict their coverage to narrow synthetic error models. This paper investigates the potential for duplication using alternate number systems to lower the costs of duplicated multiplication without sacrificing error coverage. Two examples of such low-cost duplication schemes are described and evaluated, it is shown that specialized carry-save or residue number system checking can be used to increase the efficiency of duplicated multiplication.","PeriodicalId":6526,"journal":{"name":"2015 IEEE 22nd Symposium on Computer Arithmetic","volume":"16 1","pages":"2-9"},"PeriodicalIF":0.0,"publicationDate":"2015-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76020759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reliable Evaluation of the Worst-Case Peak Gain Matrix in Multiple Precision","authors":"Anastasia Volkova, Thibault Hilaire, C. Lauter","doi":"10.1109/ARITH.2015.14","DOIUrl":"https://doi.org/10.1109/ARITH.2015.14","url":null,"abstract":"The worst-case peak gain (WCPG) of a linear filter is an important measure for the implementation of signal processing algorithms. It is used in the error propagation analysis for filters, thus a reliable evaluation with controlled precision is required. The WCPG is computed as an infinite sum and has matrix powers in each summand. We propose a direct formula for the lower bound on truncation order of the infinite sum in dependency of desired truncation error. Several multiprecision methods for complex matrix operations are developed and their error analysis performed. A multiprecision matrix powering method is presented. All methods yield a rigorous solution with an absolute error bounded by an a priori given value. The results are illustrated with numerical examples.","PeriodicalId":6526,"journal":{"name":"2015 IEEE 22nd Symposium on Computer Arithmetic","volume":"1 1","pages":"96-103"},"PeriodicalIF":0.0,"publicationDate":"2015-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75374002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Exact Real Arithmetical Algorithm in Binary Continued Fractions","authors":"P. Kurka","doi":"10.1109/ARITH.2015.20","DOIUrl":"https://doi.org/10.1109/ARITH.2015.20","url":null,"abstract":"The exact real binary arithmetical algorithm is an on-line algorithm which computes the sum, product or ratio of two real numbers to arbitrary precision. The algorithm works in general Moebius number systems which represent real numbers by infinite products of Moebius transformations. We consider a number system of binary continued fractions in which this algorithm is computed faster than in the binary signed system. Moreover, the number system of binary continued fractions circumvents the problem of nonredundancy and slow convergence of continued fractions.","PeriodicalId":6526,"journal":{"name":"2015 IEEE 22nd Symposium on Computer Arithmetic","volume":"44 1","pages":"168-175"},"PeriodicalIF":0.0,"publicationDate":"2015-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76972742","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Divide-and-Conquer Multiprecision Integer Division","authors":"William Bruce Hart","doi":"10.1109/ARITH.2015.19","DOIUrl":"https://doi.org/10.1109/ARITH.2015.19","url":null,"abstract":"We present a new divide-and-conquer algorithm for mid-range multiprecision integer division which is typically 20-25% faster than the recent algorithms of Moller and Granlund implemented in the GNU Multi Precision (GMP) library.","PeriodicalId":6526,"journal":{"name":"2015 IEEE 22nd Symposium on Computer Arithmetic","volume":"48 1","pages":"90-95"},"PeriodicalIF":0.0,"publicationDate":"2015-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80827653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Semi-Automatic Floating-Point Implementation of Special Functions","authors":"C. Lauter, M. Mezzarobba","doi":"10.1109/ARITH.2015.12","DOIUrl":"https://doi.org/10.1109/ARITH.2015.12","url":null,"abstract":"This work introduces an approach to the computer-assisted implementation of mathematical functions geared toward special functions such as those occurring in mathematical physics. The general idea is to start with an exact symbolic representation of a function and automate as much as possible of the process of implementing it. In order to deal with a large class of special functions, our symbolic representation is an implicit one: the input is a linear differential equation with polynomial coefficients along with initial values. The output is a C program to evaluate the solution of the equation using domain splitting, argument reduction and polynomial approximations in double-precision arithmetic, in the usual style of mathematical libraries. Our generation method combines symbolic-numeric manipulations of linear ODEs with interval-based tools for the floating-point implementation of \"black-box\" functions. We describe a prototype code generator that can automatically produce implementations on moderately large intervals. Implementations on the whole real line are possible in some cases but require manual tool setup and code integration. Due to this limitation and as some heuristics remain, we refer to our method as \"semi-automatic\" at this stage. Along with other examples, we present an implementation of the Voigt profile with fixed parameters that may be of independent interest.","PeriodicalId":6526,"journal":{"name":"2015 IEEE 22nd Symposium on Computer Arithmetic","volume":"93 1","pages":"58-65"},"PeriodicalIF":0.0,"publicationDate":"2015-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84167115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modulo-(2^n -- 2^q -- 1) Parallel Prefix Addition via Excess-Modulo Encoding of Residues","authors":"Seyed Hamed Fatemi Langroudi, G. Jaberipur","doi":"10.1109/ARITH.2015.9","DOIUrl":"https://doi.org/10.1109/ARITH.2015.9","url":null,"abstract":"The residue number system t = {2<sup>n</sup> - 1, 2<sup>n</sup>, 2<sup>n</sup> + 1} has been extensively studied towards perfection in realization of efficient parallel prefix modular adders, with (3 + 2logn △G latency. Many applications, such as digital signal processing require fast modular operations. However, relying only on t limits the magnitude of n, and accordingly the dynamic range. Therefore, additional mutually prime moduli are required to accommodate for wider dynamic range. On the other hand, speed of modular arithmetic operations for the additional moduli should be as close as possible to those in t. This could be best met by the moduli of the form 2<sup>n</sup> - (2<sup>q</sup> + 1), with 1 ≤ q ≤ n - 2, such as 2<sup>n</sup> - 3, 2<sup>n</sup> - 5. However, the fastest parallel prefix realization of modulo-(2<sup>n</sup> - 2<sup>q</sup> - 1) adders that we have encountered in the relevant literature, claims (7 + 2 log n)△G latency. Motivated by the need to reduce the latter, we propose new designs of such adders with (5 + 2 log n)△G latency without any penalty in area consumption or power dissipation. The proposed modular addition algorithm entails supplementary representation of residues in [0,2<sup>q</sup>], as [2<sup>n</sup> - (2<sup>q</sup> + 1), 2<sup>n</sup> - 1]. This leads to additional performance efficiency similar to the effect of double zero representation in modulo-(2<sup>n</sup> - 1) adders. The aforementioned analytically evaluated speed gain and improvements in other figures of merit are also supported via circuit simulation and synthesis.","PeriodicalId":6526,"journal":{"name":"2015 IEEE 22nd Symposium on Computer Arithmetic","volume":"2 1","pages":"121-128"},"PeriodicalIF":0.0,"publicationDate":"2015-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83098151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Design and Implementation of an Embedded FPGA Floating Point DSP Block","authors":"M. Langhammer, B. Pasca","doi":"10.1109/ARITH.2015.18","DOIUrl":"https://doi.org/10.1109/ARITH.2015.18","url":null,"abstract":"This paper describes the architecture and implementation, from both the standpoint of target applications as well as circuit design, of an FPGA DSP Block that can efficiently support both fixed and single precision (SP) floating-point (FP) arithmetic. Most contemporary FPGAs embed DSP blocks that provide simple multiply-add-based fixed-point arithmetic cores. Current FP arithmetic FPGA solutions make use of these hardened DSP resources, together with embedded memory blocks and soft logic resources, however, larger systems cannot be efficiently implemented due to the routing and soft logic limitations on the devices, resulting in significant area, performance, and power consumption penalties compared to ASIC implementations. In this paper we analyse earlier proposed embedded FP implementations, and show why they are not suitable for a production FPGA. We contrast these against our solution -- a unified DSP Block -- where (a) the SP FP multiplier is overlaid on the fixed point constructs, (b) the SP FP Adder/Subtracter is integrated as a separate unit, and (c) the multiplier and adder can be combined in a way that is both arithmetically useful, but also efficient in terms of FPGA routing density and congestion. In addition, a novel way of seamlessly combining any number of DSP Blocks in a low latency structure will be introduced. We will show that this new approach allows a low cost, low power, and high density FP platform on current production 20nm FPGAs. We also describe a future enhancement of the DSP block that can support subnormal numbers.","PeriodicalId":6526,"journal":{"name":"2015 IEEE 22nd Symposium on Computer Arithmetic","volume":"128 1","pages":"26-33"},"PeriodicalIF":0.0,"publicationDate":"2015-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87912391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The end of numerical error","authors":"J. Gustafson","doi":"10.1109/ARITH.2015.34","DOIUrl":"https://doi.org/10.1109/ARITH.2015.34","url":null,"abstract":"Summary form only given, as follows. The full paper was not made available as part of this conference proceedings. It is time to overthrow a century of methods based on floating point arithmetic. Current technical computing is based on the acceptance of rounding error using numerical representations that were invented in 1914, and acceptance of sampling error using algorithms designed for a time when transistors were very expensive. By sticking to an antiquated storage format (now codified as an IEEE standard) well into the exascale area, we are wasting power, energy, storage, bandwidth, and programmer effort. The pursuit of exascale floating point is ridiculous, since we do not need to be making 10^18 sloppy rounding errors per second; we need instead to get provable, valid results for the first time, by turning the speed of parallel computers into higher quality answers instead of more junk per second. We introduce the 'unum' (universal number), a superset of IEEE Floating Point, that contains extra metadata fields that actually save storage, yet give more accurate answers that do not round, overflow, or underflow. The potential they offer for improved programmer productivity is enormous. They also provide, for the first time, the hope of a numerical standard that guarantees bitwise identical results across different computer architectures. Unum format is the basis for the 'ubox' method, which redefines what is meant by \"high performance\" by measuring performance in terms of the knowledge obtained about the answer and not the operations performed per second. Examples are given for practical application to structural analysis, radiation transfer, the n-body problem, linear and nonlinear systems of equations, and Laplace’s equation. This is a fresh approach to scientific computing that allows proper, rigorous representation of real number sets for the first time.","PeriodicalId":6526,"journal":{"name":"2015 IEEE 22nd Symposium on Computer Arithmetic","volume":"35 1","pages":"74"},"PeriodicalIF":0.0,"publicationDate":"2015-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89332574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reproducible Tall-Skinny QR","authors":"Hong Diep Nguyen, J. Demmel","doi":"10.1109/ARITH.2015.28","DOIUrl":"https://doi.org/10.1109/ARITH.2015.28","url":null,"abstract":"Reproducibility is the ability to obtain bitwise identical results from different runs of the same program on the same input data, regardless of the available computing resources, or how they are scheduled. Recently, techniques have been proposed to attain reproducibility for BLAS operations, all of which rely on reproducibly computing the floating-point sum and dot product. Nonetheless, a reproducible BLAS library does not automatically translate into a reproducible higher-level linear algebra library, especially when communication is optimized. For instance, for the QR factorization, conventional algorithms such as Householder transformation or Gram-Schmidt process can be used to reproducibly factorize a floating-point matrix by fixing the high-level order of computation, for example column-by-column from left to right, and by using reproducible versions of level-1 BLAS operations such as dot product and 2-norm. In a massively parallel environment, those algorithms have high communication cost due to the need for synchronization after each step. The Tall-Skinny QR algorithm obtains much better performance in massively parallel environments by reducing the number of messages by a factor of n to O(log(P)) where P is the processor count, by reducing the number of reduction operations to O(1). Those reduction operations however are highly dependent on the network topology, in particular the number of computing nodes, and therefore are difficult to implement reproducibly and with reasonable performance. In this paper we present a new technique to reproducibly compute a QR factorization for a tall skinny matrix, which is based on the Cholesky QR algorithm to attain reproducibility as well as to improve communication cost, and the iterative refinement technique to guarantee the accuracy of the computed results. Our technique exhibits strong scalability in massively parallel environments, and at the same time can provide results of almost the same accuracy as the conventional Householder QR algorithm unless the matrix is extremely badly conditioned, in which case a warning can be given. Initial experimental results in Matlab show that for not too ill-conditioned matrices whose condition number is smaller than sqrt(1/e) where e is the machine epsilon, our technique runs less than 4 times slower than the built-in Matlab qr() function, and always computes numerically stable results in terms of column-wise relative error.","PeriodicalId":6526,"journal":{"name":"2015 IEEE 22nd Symposium on Computer Arithmetic","volume":"31 1","pages":"152-159"},"PeriodicalIF":0.0,"publicationDate":"2015-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85481966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nicolas Brunie, F. D. Dinechin, O. Kupriianova, C. Lauter
{"title":"Code Generators for Mathematical Functions","authors":"Nicolas Brunie, F. D. Dinechin, O. Kupriianova, C. Lauter","doi":"10.1109/ARITH.2015.22","DOIUrl":"https://doi.org/10.1109/ARITH.2015.22","url":null,"abstract":"A typical floating-point environment includes support for a small set of about 30 mathematical functions such as exponential, logarithm, trigonometric and hyperbolic functions. These functions are provided by mathematical software libraries (libm), typically in IEEE754 single, double and quad precision. This article suggests to replace this libm paradigm by a more general approach: the on-demand generation of numerical function code, on arbitrary domains and with arbitrary accuracies. First, such code generation opens up the libm function space available to programmers. It may capture a much wider set of functions, and may capture even standard functions on non-standard domains and accuracy/performance points. Second, writing libm code requires fine-tuned instruction selection and scheduling for performance, and sophisticated floating-point techniques for accuracy. Automating this task through code generation improves confidence in the code while enabling better design space exploration, and therefore better time to market, even for the libm functions. This article discusses the new challenges of this paradigm shift, and presents the current state of open-source function code generators available on http://www.metalibm.org/.","PeriodicalId":6526,"journal":{"name":"2015 IEEE 22nd Symposium on Computer Arithmetic","volume":"12 1","pages":"66-73"},"PeriodicalIF":0.0,"publicationDate":"2015-06-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79130514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}