{"title":"Augmented Arithmetic Operations Proposed for IEEE-754 2018","authors":"Jason Riedy, J. Demmel","doi":"10.1109/ARITH.2018.8464813","DOIUrl":"https://doi.org/10.1109/ARITH.2018.8464813","url":null,"abstract":"Algorithms for extending arithmetic precision through compensated summation or arithmetics like double-double rely on operations commonly called twoSum and twoProd-uct. The current draft of the IEEE 754 standard specifies these operations under the names augmentedAddition and augment-edMultiplication. These operations were included after three decades of experience because of a motivating new use: bitwise reproducible arithmetic. Standardizing the operations provides a hardware acceleration target that can provide at least a 33 % speed improvements in reproducible dot product, placing reproducible dot product almost within a factor of two of common dot product. This paper provides history and motivation for standardizing these operations. We also define the operations, explain the rationale for all the specific choices, and provide parameterized test cases for new boundary behaviors.","PeriodicalId":6576,"journal":{"name":"2018 IEEE 25th Symposium on Computer Arithmetic (ARITH)","volume":"286 1","pages":"45-52"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73257723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On Various Ways to Split a Floating-Point Number","authors":"C. Jeannerod, J. Muller, P. Zimmermann","doi":"10.1109/ARITH.2018.8464793","DOIUrl":"https://doi.org/10.1109/ARITH.2018.8464793","url":null,"abstract":"We review several ways to split a floating-point number, that is, to decompose it into the exact sum of two floating-point numbers of smaller precision. All the methods considered here involve only a few IEEE floating-point operations, with rounding to nearest and including possibly the fused multiply -add (FMA). Applications range from the implementation of integer functions such as round and floor to the computation of suitable scaling factors aimed, for example, at avoiding spurious underflows and overflows when implementing functions such as the hypotenuse.","PeriodicalId":6576,"journal":{"name":"2018 IEEE 25th Symposium on Computer Arithmetic (ARITH)","volume":"42 1","pages":"53-60"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72779520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A New Variant of the Barrett Algorithm Applied to Quotient Selection","authors":"Niall Emmart, Fangyu Zheng, C. Weems","doi":"10.1109/ARITH.2018.8464771","DOIUrl":"https://doi.org/10.1109/ARITH.2018.8464771","url":null,"abstract":"Quotient Selection (QS) is a key step in the classic $O(n^{2}$) multiple precision division algorithm. On processors with fast hardware division, it is a trivial problem, but on GPUs, division is quite slow. In this paper we investigate the effectiveness of Brent and Zimmermann's variant as well as our own novel variant of Barrett's algorithm. Our new approach is shown to be suitable for low radix (single precision) QS. Three highly optimized implementations, two of the Brent and Zimmerman variant and one based on our new approach, have been developed and we show that each is many times faster than using the division operation built in to the compiler. In addition, our variant is on average 22 % faster than the other two implementations. We also sketch proofs of correctness for all of the implementations and our new algorithm.","PeriodicalId":6576,"journal":{"name":"2018 IEEE 25th Symposium on Computer Arithmetic (ARITH)","volume":"18 1","pages":"138-144"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77231484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Brisebarre, G. Constantinides, Milos Ercezovac, Silviu-Ioan Filip, Matei Iştoan, J. Muller
{"title":"A High Throughput Polynomial and Rational Function Approximations Evaluator","authors":"N. Brisebarre, G. Constantinides, Milos Ercezovac, Silviu-Ioan Filip, Matei Iştoan, J. Muller","doi":"10.1109/ARITH.2018.8464778","DOIUrl":"https://doi.org/10.1109/ARITH.2018.8464778","url":null,"abstract":"We present an automatic method for the evaluation of functions via polynomial or rational approximations and its hardware implementation, on FPGAs. These approximations are evaluated using Ercegovac's iterative E-method adapted for FPGA implementation. The polynomial and rational function coefficients are optimized such that they satisfy the constraints of the E-method. We present several examples of practical interest; in each case a resource-efficient approximation is proposed and comparisons are made with alternative approaches.","PeriodicalId":6576,"journal":{"name":"2018 IEEE 25th Symposium on Computer Arithmetic (ARITH)","volume":"63 1","pages":"99-106"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84311677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cristina S. Anderson, Jingwei Zhang, Marius Cornea
{"title":"Enhanced Vector Math Support on the Intel®AVX-512 Architecture","authors":"Cristina S. Anderson, Jingwei Zhang, Marius Cornea","doi":"10.1109/ARITH.2018.8464794","DOIUrl":"https://doi.org/10.1109/ARITH.2018.8464794","url":null,"abstract":"The Intel®AVX-512 architecture adds new capabilities such as masked execution, floating-point exception suppression and static rounding modes, as well as a small set of new instructions for mathematical library support. These new features allow for better compliance with floating-point or language standards (e.g. no spurious floating-point exceptions, and faster or more accurate code for directed rounding modes), as well as simpler, smaller footprint implementations that eliminate branches and special case paths. Performance is also improved, in particular for vector mathematical functions (which benefit from easier processing in the main path, and fast access to small lookup tables). In this paper, we describe the relevant new features and their possible applications to floating-point computation. The code examples include a few compact implementation sequences for some common vector mathematical functions.","PeriodicalId":6576,"journal":{"name":"2018 IEEE 25th Symposium on Computer Arithmetic (ARITH)","volume":"1 1","pages":"120-124"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77286435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast multiplication of binary polynomials with the forthcoming vectorized VPCLMULQDQ instruction","authors":"Nir Drucker, S. Gueron, V. Krasnov","doi":"10.1109/ARITH.2018.8464777","DOIUrl":"https://doi.org/10.1109/ARITH.2018.8464777","url":null,"abstract":"Polynomial multiplication over binary fields $mathbb{F}_{2^{n}}$ is a common primitive, used for example by current cryptosystems such as AES-GCM (with $n=128)$. It also turns out to be a primitive for other cryptosystems, that are being designed for the Post Quantum era, with values $ngg 128$. Examples from the recent submissions to the NIST Post-Quantum Cryptography project, are BIKE, LEDAKem, and GeMSS, where the performance of the polynomial multiplications, is significant. Therefore, efficient polynomial multiplication over $mathbb{F}_{2^{n}}$, with large $n$, is a significant emerging optimization target. Anticipating future applications, Intel has recently announced that its future architecture (codename “Ice Lake”) will introduce a new vectorized way to use the current VPCLMULQDQ instruction. In this paper, we demonstrate how to use this instruction for accelerating polynomial multiplication. Our analysis shows a prediction for at least 2x speedup for multiplications with polynomials of degree 512 or more.","PeriodicalId":6576,"journal":{"name":"2018 IEEE 25th Symposium on Computer Arithmetic (ARITH)","volume":"71 1","pages":"115-119"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90618571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Tunable Floating-Point for Energy Efficient Accelerators","authors":"A. Nannarelli","doi":"10.1109/ARITH.2018.8464797","DOIUrl":"https://doi.org/10.1109/ARITH.2018.8464797","url":null,"abstract":"In this work, we address the design of an on-chip accelerator for Machine Learning and other computation-demanding applications with a Tunable Floating-Point (TFP) precision. The precision can be chosen for a single operation by selecting a specific number of bits for significand and exponent in the floating-point representation. By tuning the precision of a given algorithm to the minimum precision achieving an acceptable target error, we can make the computation more power efficient. We focus on floating-point multiplication, which is the most power demanding arithmetic operation.","PeriodicalId":6576,"journal":{"name":"2018 IEEE 25th Symposium on Computer Arithmetic (ARITH)","volume":"6 1","pages":"29-36"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85581806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Radix-64 Floating-Point Divider","authors":"J. Bruguera","doi":"10.1109/ARITH.2018.8464815","DOIUrl":"https://doi.org/10.1109/ARITH.2018.8464815","url":null,"abstract":"Digit-recurrence division is widely used in actual high-performance microprocessors because it presents a good trade-off in terms of performance, area and power. consumption. In this paper we present a radix-64 divider, providing 6 bits per cycle. To have an affordable implementation, each iteration is composed of three radix-4 iterations; speculation is used between consecutive radix-4 iterations to get a reduced timing. The result is a fast, low-latency floating-point divider, requiring 11, 6, and 4 cycles for double-precision, single-precision and half-precision floating-point division with normalized operands and result. One or two additional cycles are needed in case of subnormal operand(s) or result.","PeriodicalId":6576,"journal":{"name":"2018 IEEE 25th Symposium on Computer Arithmetic (ARITH)","volume":"46 1","pages":"84-91"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88153788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Comeback of Reed Solomon Codes","authors":"Nir Drucker, S. Gueron, V. Krasnov","doi":"10.1109/ARITH.2018.8464690","DOIUrl":"https://doi.org/10.1109/ARITH.2018.8464690","url":null,"abstract":"Distributed storage systems utilize erasure codes to reduce their storage costs while efficiently handling failures. Many of these codes (e. g., Reed-Solomon (RS) codes) rely on Galois Field (GF) arithmetic, which is considered to be fast when the field characteristic is 2. Nevertheless, some developments in the field of erasure codes offer new efficient techniques that require mostly XOR operations, and are thus faster than GF operations. Recently, Intel announced [1] that its future architecture (codename “Ice Lake”) will introduce new set of instructions called Galois Field New Instruction (GF-NI). These instructions allow software flows to perform vector and matrix multiplications over GF (28) on the wide registers that are available on the AVX512 architectures. In this paper, we explain the functionality of these instructions, and demonstrate their usage for some fast computations in GF(28). We also use the Intel® Intelligent Storage Acceleration Library (ISA-L) in order to estimate potential future improvement for erasure codes that are based on RS codes. Our results predict $approx 1.4mathrm{x}$ speedup for vectorized multiplication, and 1.83x speedup for the actual encoding.","PeriodicalId":6576,"journal":{"name":"2018 IEEE 25th Symposium on Computer Arithmetic (ARITH)","volume":"56 1","pages":"125-129"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82556632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Faster Modular Exponentiation Using Double Precision Floating Point Arithmetic on the GPU","authors":"Niall Emmart, Fangyu Zheng, C. Weems","doi":"10.1109/ARITH.2018.8464792","DOIUrl":"https://doi.org/10.1109/ARITH.2018.8464792","url":null,"abstract":"This paper presents a new approach to integer multiple precision (MP) modular exponentiation, using double-precision floating point (DPF) operations, that is suitable for GPU implementation. We show speedups ranging from 20 % to 34 % over the best prior G PU times for sizes corresponding to common RSA cryptographic operations (2048 to 4096 bits). Three techniques are described. First, by adding 2104to the high half of the product, and 252 to the low half, we set the implicit leading 1 in the DPF mantissa so that the full 52 explicit bits are available for each half of the 104-bit products of samples. Second, the DPF values are cast bitwise to 64-bit integers for adding the column sums to get the MP result. Normally the cast would require masking off the exponents, but because they are constant, we can include them in the column sums and correct just once for their total. Third, by initializing the column sums with the appropriate negative value to compensate for the exponent sums, no corrective subtraction is needed. Our implementation on an NVIDIA GTX Titan Black GPU achieves between 132.5K and 161.9K modular exponentiations per second of size 1024 bits, with latencies ranging from 21.7 ms to 17.8 ms, making it practical for online RSA applications. Proportional results are shown for 1536 and 2048 bits. The implementation is so efficient that its maximum sustained performance is actually bounded by the thermal limit of the GPU.","PeriodicalId":6576,"journal":{"name":"2018 IEEE 25th Symposium on Computer Arithmetic (ARITH)","volume":"2013 1","pages":"130-137"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82608726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}