{"title":"The Arithmetic Operators You Will Never See in a Microprocessor","authors":"F. D. Dinechin","doi":"10.1109/ARITH.2011.33","DOIUrl":"https://doi.org/10.1109/ARITH.2011.33","url":null,"abstract":"It has been shown that FPGAs could outperform high-end microprocessors even on floating-point computations, thanks to massive parallelism. Too often, however, such studies re-implement in the FPGA the operators present in a processor. An FPGA can do much better: it can accomodate hardware operators that would make no economical sense in a general-purpose processor, and it can taylor them just right to the needs of the application. This talk tries to survey this idea systematically, discussing its potential, exhibiting some exotic (but useful) operators developed in the FloPoCo project, and listing some of the challenges ahead.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123566043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Radix-16 Combined Division and Square Root Unit","authors":"A. Nannarelli","doi":"10.1109/ARITH.2011.30","DOIUrl":"https://doi.org/10.1109/ARITH.2011.30","url":null,"abstract":"Division and square root, based on the digit-recurrence algorithm, can be implemented in a combined unit. Several implementations of combined division/square root units have been presented mostly for radices 2 and 4. Here, we present a combined radix-16 unit obtained by overlapping two radix-4 result digit selection functions, as it is normally done for division only units. The latency of the unit is reduced by retiming and low power methods are applied as well. The proposed unit is compared to a radix-4 combined division/square root unit, and to a radix-16 unit, obtained by cascading two radix-4 stages, which is similar to the one implemented in a state-of-the-art processor.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129745834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Arnold, J. Cowles, Vassilis Paliouras, I. Kouretas
{"title":"Towards a Quaternion Complex Logarithmic Number System","authors":"M. Arnold, J. Cowles, Vassilis Paliouras, I. Kouretas","doi":"10.1109/ARITH.2011.14","DOIUrl":"https://doi.org/10.1109/ARITH.2011.14","url":null,"abstract":"The well-known generalization of real to complex arithmetic (two reals) extends further to more obscure quaternion arithmetic (four reals), which has applications in signal processing, aerospace, graphics and virtual reality. Quaternion multiplication implements 3D rotation, but is expensive (usually 16 floating-point multiplications and 12 additions). This paper proposes an alternative quaternion representation using logarithms to reduce multiplication cost. The real Logarithmic Number System (LNS) allows fast and inexpensive multiplication and division in embedded and FPGA-based systems. Recent advances in the Complex LNS (CLNS) [5] have made fast log-polar complex representation affordable. Although the quaternion logarithm function is also well-defined, it is not useful to simplify multiplication (in the same way real and complex logarithms are) because quaternion multiplication is not commutative but quaternion addition is. To overcome this, we propose a novel Quaternion Complex (QCLNS) representation using a pair of CLNS numbers. This representation implements quaternion multiplication using only the theoretical minimum [11], [15] of 8 LNS multipliers (i.e., fixed-point adders) and two CLNS adders. Because CLNS numbers are more compact than ordinary rectangular complex representation, single-precision QCLNS occupies 10.9 percent less memory than conventional quaternion representation. Extrapolating conventional LNS and floating-point synthesis data from Fu et al. [12], QCLNS saves on average 10 percent of FPGA resources for precisions between 13 and 45 bits.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125741370","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic Generation of Fast and Certified Code for Polynomial Evaluation","authors":"C. Mouilleron, G. Revy","doi":"10.1109/ARITH.2011.39","DOIUrl":"https://doi.org/10.1109/ARITH.2011.39","url":null,"abstract":"Designing an efficient floating-point implementation of a function based on polynomial evaluation requires being able to find an accurate enough evaluation code, exploiting at most the target architecture features. This article introduces CGPE, a tool dealing with the generation of fast and certified codes for the evaluation of bivariate polynomials. First we discuss the issue underlying the evaluation scheme combinatorics before giving an overview of the CGPE tool. The approach we propose consists in two steps: the generation of evaluation schemes by using some heuristics so as to quickly find some of low latency, and the selection that mainly consists in automatically checking their scheduling on the given target and validating their accuracy. Then, we present on-going development and ideas for possible improvements of the whole process. Finally, we illustrate the use of CGPE on some examples, and show how it allows us to generate fast and certified codes in a few seconds and thus to reduce the development time of libms like FLIP.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"44 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125892811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Short Division of Long Integers","authors":"David Harvey, P. Zimmermann","doi":"10.1109/ARITH.2011.11","DOIUrl":"https://doi.org/10.1109/ARITH.2011.11","url":null,"abstract":"We consider the problem of short division -- i.e., approximate quotient -- of multiple-precision integers. We present ready-to-implement algorithms that yield an approximation of the quotient, with tight and rigorous error bounds. We exhibit speedups of up to 30% with respect to GMP division with remainder, and up to 10% with respect to GMP short division, with room for further improvements. This work enables one to implement fast correctly rounded division routines in multiple-precision software tools.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129337700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Radix-8 Digit-by-Rounding: Achieving High-Performance Reciprocals, Square Roots, and Reciprocal Square Roots","authors":"J. A. Butts, P. T. P. Tang, R. Dror, D. Shaw","doi":"10.1109/ARITH.2011.28","DOIUrl":"https://doi.org/10.1109/ARITH.2011.28","url":null,"abstract":"We describe a high-performance digit-recurrence algorithm for computing exactly rounded reciprocals, square roots, and reciprocal square roots in hardware at a rate of three result bits -- one radix-8 digit -- per recurrence iteration. To achieve a single-cycle recurrence at a short cycle time, we adapted the digit-by-rounding algorithm, which is normally applied at much higher radices, for efficient operation at radix 8. Using this approach avoids in the recurrence step the lookup table required by SRT -- the usual algorithm used for hardware digit recurrences. The increasing access latency of this table, the size of which grows super linearly in the radix, limits high-frequency SRT implementations to radix 4 or lower. We also developed a series of novel optimizations focused on further reducing the critical path through the recurrence. We propose, for example, decreasing data path widths to a point where erroneous results sometimes occur and then correcting these errors off the critical path. We present a specific implementation that computes any of these functions to 31 bits of precision in 13 cycles. Our implementation achieves a cycle time only 11% longer than the best reported SRT design for the same functions, yet delivers results in five fewer cycles. Finally, we show that even at lower radices, a digit-by-rounding design is likely to have a shorter critical path than one using SRT at the same radix.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122996157","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Composite Iterative Algorithm and Architecture for q-th Root Calculation","authors":"Álvaro Vázquez, J. Bruguera","doi":"10.1109/ARITH.2011.16","DOIUrl":"https://doi.org/10.1109/ARITH.2011.16","url":null,"abstract":"An algorithm for the q-th root extraction, q being any integer, is presented in this paper. The algorithm is based on an optimized implementation of X^{1/q} by a sequence of parallel and/or overlapped operations: (1) reciprocal, (2) digit-recurrence logarithm, (3) left-to-right carry-free multiplication and (4) on-line exponential. A detailed error analysis and two architectures are proposed, for low precision q and for higher precision q. The execution time and hardware requirements are estimated for single precision floating-point computations for several radices, this helps to determine which radices result in the most efficient implementations. The architectures proposed improve the features of other architectures for q-th root extraction.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"135 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128252969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
T. Anderson, Duc Bui, S. Moharil, Soujanya Narnur, Mujibur Rahman, A. Lell, Eric Biscondi, A. Shrivastava, P. Dent, Mingjian Yan, Hasan Mahmood
{"title":"A 1.5 Ghz VLIW DSP CPU with Integrated Floating Point and Fixed Point Instructions in 40 nm CMOS","authors":"T. Anderson, Duc Bui, S. Moharil, Soujanya Narnur, Mujibur Rahman, A. Lell, Eric Biscondi, A. Shrivastava, P. Dent, Mingjian Yan, Hasan Mahmood","doi":"10.1109/ARITH.2011.20","DOIUrl":"https://doi.org/10.1109/ARITH.2011.20","url":null,"abstract":"A next generation VLIW DSP Central Processing Unit (CPU) which has an integrated fixed point and floating point Instruction Set Architecture (ISA) is presented. It is designed to meet a 1.5 GHz core clock frequency in a 40nm process with aggressive area and power goals. In this paper, the benchmarking process and benefits of newly defined instructions such as complex matrix multiply is explained. Also, the CPU data path is described in detail, highlighting several novel micro-architecture features. Finally, our design methodology as well as verification methodology to ensure functional correctness utilizing formal equivalent verification is described.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117002340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High Degree Toom'n'Half for Balanced and Unbalanced Multiplication","authors":"Marco Bodrato","doi":"10.1109/ARITH.2011.12","DOIUrl":"https://doi.org/10.1109/ARITH.2011.12","url":null,"abstract":"Some hints and tricks to automatically obtain high degree Toom-Cook implementations, i.e. functions for integer or polynomial multiplication with a reduced complexity. The described method generates quite an efficient sequence of operations and the memory footprint is kept low by using a new strategy: mixing evaluation, interpolation and recomposition phases. It is possible to automatise the whole procedure obtaining a general Toom-n function, and to extend the method to polynomials in any characteristic except two.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128857592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Flocq: A Unified Library for Proving Floating-Point Algorithms in Coq","authors":"S. Boldo, G. Melquiond","doi":"10.1109/ARITH.2011.40","DOIUrl":"https://doi.org/10.1109/ARITH.2011.40","url":null,"abstract":"Several formalizations of floating-point arithmetic have been designed for the Coq system, a generic proof assistant. Their different purposes have favored some specific applications: program verification, high-level properties, automation. Based on our experience using and/or developing these libraries, we have built a new system that is meant to encompass the other ones in a unified framework. It offers a multi-radix and multi-precision formalization for various floating- and fixed-point formats. This fresh setting has been the occasion for reevaluating known properties and generalizing them. This paper presents design decisions and examples of theorems from the Flocq system: a library easy to use, suitable for automation yet high-level and generic.","PeriodicalId":272151,"journal":{"name":"2011 IEEE 20th Symposium on Computer Arithmetic","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121551077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}