2013 IEEE 21st Symposium on Computer Arithmetic最新文献_第2页

FPU Generator for Design Space Exploration 设计空间探索的FPU发生器

2013 IEEE 21st Symposium on Computer Arithmetic Pub Date : 2013-04-07 DOI: 10.1109/ARITH.2013.27

Sameh Galal, Ofer Shacham, J. Brunhaver, Jing Pu, A. Vassiliev, M. Horowitz

{"title":"FPU Generator for Design Space Exploration","authors":"Sameh Galal, Ofer Shacham, J. Brunhaver, Jing Pu, A. Vassiliev, M. Horowitz","doi":"10.1109/ARITH.2013.27","DOIUrl":"https://doi.org/10.1109/ARITH.2013.27","url":null,"abstract":"FPUs have been a topic of research for almost a century, leading to thousands of papers and books. Each advance focuses on the virtues of some specific new technique. This paper compares the energy efficiency of both throughput-optimized and latency-sensitive designs, each employing an array of optimization techniques, through a fair \"apples to apples\" methodology. This comparison required us to build many optimized FP units. We accomplished this by creating a highly parameterized FPgenerator, hierarchically encompassing lower-level generators for summation trees, Booth encoders, adders, etc. Having constructed this generator we quickly relearned a number of low-level issues that are critical and are often the most neglected by papers. By exploring cascade and fused multiply-add architectures across a variety of bit widths, summation trees, booth encoders, pipelining techniques, and pipe depths, we found that for most throughput based designs, a Booth-3 fused multiply-add architecture with a Wallace combining tree is optimal. For latency designs, we found that Booth-2 cascade multiply-add architectures are better. As we describe in the paper, Wallace is not always the optimal combining network due to wire delay and track count, and the precise way the CSA's are connected in the tree can make a larger difference than the type of tree used.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"178 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129583913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37

Another Look at Inversions over Binary Fields 再看一下二进制域上的反转

2013 IEEE 21st Symposium on Computer Arithmetic Pub Date : 2013-04-07 DOI: 10.1109/ARITH.2013.25

V. Dimitrov, K. Järvinen

引用次数: 20

Fault Detection in RNS Montgomery Modular Multiplication RNS Montgomery模乘法的故障检测

2013 IEEE 21st Symposium on Computer Arithmetic Pub Date : 2013-04-07 DOI: 10.1109/ARITH.2013.31

J. Bajard, J. Eynard, F. Gandino

引用次数: 23

On-the-Fly Multi-base Recoding for ECC Scalar Multiplication without Pre-computations 动态多基重编码的ECC标量乘法没有预先计算

2013 IEEE 21st Symposium on Computer Arithmetic Pub Date : 2013-04-07 DOI: 10.1109/arith.2013.17

Thomas Chabrier, A. Tisserand

引用次数: 21

A Formally-Verified C Compiler Supporting Floating-Point Arithmetic 支持浮点运算的经过正式验证的C编译器

2013 IEEE 21st Symposium on Computer Arithmetic Pub Date : 2013-04-07 DOI: 10.1109/ARITH.2013.30

S. Boldo, Jacques-Henri Jourdan, X. Leroy, G. Melquiond

{"title":"A Formally-Verified C Compiler Supporting Floating-Point Arithmetic","authors":"S. Boldo, Jacques-Henri Jourdan, X. Leroy, G. Melquiond","doi":"10.1109/ARITH.2013.30","DOIUrl":"https://doi.org/10.1109/ARITH.2013.30","url":null,"abstract":"Floating-point arithmetic is known to be tricky: roundings, formats, exceptional values. The IEEE-754 standard was a push towards straightening the field and made formal reasoning about floating-point computations easier and flourishing. Unfortunately, this is not sufficient to guarantee the final result of a program, as several other actors are involved: programming language, compiler, architecture. The Comp Certformally-verified compiler provides a solution to this problem: this compiler comes with a mathematical specification of the semantics of its source language (a large subset of ISO C90) and target platforms (ARM, PowerPC, x86-SSE2), and with a proof that compilation preserves semantics. In this paper, we report on our recent success in formally specifying and proving correct Comp Cert's compilation of floating-point arithmetic. Since CompCert is verified using the Coq proof assistant, this effort required a suitable Coq formalization of the IEEE-754 standard, we extended the Flocq library for this purpose. As a result, we obtain the first formally verified compiler that provably preserves the semantics of floating-point programs.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125361771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31

A Non-Linear/Linear Instruction Set Extension for Lightweight Ciphers 轻量级密码的非线性/线性指令集扩展

2013 IEEE 21st Symposium on Computer Arithmetic Pub Date : 2013-04-07 DOI: 10.1109/ARITH.2013.36

Susanne Engels, E. Kavun, C. Paar, Tolga Yalçin, Hristina Mihajloska

{"title":"A Non-Linear/Linear Instruction Set Extension for Lightweight Ciphers","authors":"Susanne Engels, E. Kavun, C. Paar, Tolga Yalçin, Hristina Mihajloska","doi":"10.1109/ARITH.2013.36","DOIUrl":"https://doi.org/10.1109/ARITH.2013.36","url":null,"abstract":"Modern cryptography today is substantially involved with securing lightweight (and pervasive) devices. For this purpose, several lightweight cryptographic algorithms have already been proposed. Up to now, the literature has focused on hardware-efficiency while lightweight with respect to software has barely been addressed. However, a large percentage of lightweight ciphers will be implemented on embedded CPUs- without support for cryptographic operations. In parallel, many lightweight ciphers are based on operations which are hardware-friendly but quite costly in software. For instance, bit permutations that accrue essentially no costs in hardware require a non-trivial number of CPU cycles and/or lookup tables in software. Similarly, S-Boxes often require relatively large lookup tables in software. In this work, we try to address the open question of efficient cipher implementations on small CPUs by introducing a non-linear/linear instruction set extension, to which we refer to as NLU, capable of implementing on-linear operations expressed in their algebraic normal form(ANF) and linear operations expressed in binary \"matrix multiply-and-add\" form. The proposed NLU is targeted for embedded micro controllers and it is therefore 8-bit wide. However, its modular architecture allows it to be used in16, 32, 64 and even 4-bit CPUs. We furthermore present examples of the use of NLU in the implementation of standard cryptographic algorithms in order to demonstrate its coding advantage.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126950434","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

The Unary Arithmetical Algorithm in Bimodular Number Systems 双模数系统中的一元算术算法

2013 IEEE 21st Symposium on Computer Arithmetic Pub Date : 2013-04-07 DOI: 10.1109/ARITH.2013.10

P. Kurka, M. Delacourt

引用次数: 5

Accurate and Fast Evaluation of Elementary Symmetric Functions 初等对称函数的精确快速求值

2013 IEEE 21st Symposium on Computer Arithmetic Pub Date : 2013-04-07 DOI: 10.1109/ARITH.2013.18

Hao Jiang, S. Graillat, R. Barrio

引用次数: 8

A Fast Circuit Topology for Finding the Maximum of N k-bit Numbers 一种求N个k位最大值的快速电路拓扑

2013 IEEE 21st Symposium on Computer Arithmetic Pub Date : 2013-04-07 DOI: 10.1109/ARITH.2013.35

Bilgiday Yuce, H. F. Ugurdag, Sezer Gören, Günhan Dündar

{"title":"A Fast Circuit Topology for Finding the Maximum of N k-bit Numbers","authors":"Bilgiday Yuce, H. F. Ugurdag, Sezer Gören, Günhan Dündar","doi":"10.1109/ARITH.2013.35","DOIUrl":"https://doi.org/10.1109/ARITH.2013.35","url":null,"abstract":"Finding the value and/or address (position) of the maximum element of a set of binary numbers is a fundamental arithmetic operation. Numerous systems, which are used in different application areas, require fast (low-latency) circuits to carry out this operation. We propose a fast circuit topology called Array-Based maximum finder (AB) to determine both value and address of the maximum element within an n-element set of k-bit binary numbers. AB is based on carrying out all of the required comparisons in parallel and then simultaneously computing the address as well as the value of the maximum element. This approach ends up with only one comparator on the critical path, followed by some selection logic. The time complexity of the proposed architecture is O(log2n + log2k) whereas the area complexity is O(n2k). We developed RTL code generators for AB as well as its competitors. These generators are scalable to any value of n and k. We applied a standard-cell based iterative synthesis flow that finds the optimum time constraint through binary search. The synthesis results showed that AB is 1.2-2.1 times (1.6 times on the average) faster than the state-of-the-art.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129836387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

The Floating-Point Unit of the Jaguar x86 Core 捷豹x86核心的浮点单元

2013 IEEE 21st Symposium on Computer Arithmetic Pub Date : 2013-04-07 DOI: 10.1109/ARITH.2013.24

J. Rupley, J. King, Eric Quinnell, F. Galloway, Ken Patton, P. Seidel, James Dinh, Hai Bui, A. Bhowmik

{"title":"The Floating-Point Unit of the Jaguar x86 Core","authors":"J. Rupley, J. King, Eric Quinnell, F. Galloway, Ken Patton, P. Seidel, James Dinh, Hai Bui, A. Bhowmik","doi":"10.1109/ARITH.2013.24","DOIUrl":"https://doi.org/10.1109/ARITH.2013.24","url":null,"abstract":"The AMD Jaguar x86 core uses a fully-synthesized, 128-bit native floating-point unit (FPU) built as a co-processor model. The Jaguar FPU supports several x86 ISA extensions, including x87, MMX, SSE1 through SSE4.2, AES, CLMUL, AVX, and F16C instruction sets. The front end of the unit decodes two complex operations per cycle and uses a dedicated renamer (RN), free list (FL), and retire queue (RQ) for in-order dispatch and retire. The FPU issues to the execution units with a dedicated out-of-order, dual-issue scheduler. Execution units source operands from a synthesized physical register file (PRF) and bypass network. The back end of the unit has two execution pipes: the first pipe contains a vector integer ALU, a vector integer MUL unit, and a floating-point adder (FPA), the second pipe contains a vector integer ALU, a store-convert unit, and a floating-point iterative multiplier (FPM). The implementation of the unit focused on low-power design and on vectorized single-precision (SP) performance optimizations. The verification of the unit required complex pseudo-random and formal verification techniques. The Jaguar FPU is built in a 28nm CMOS process.","PeriodicalId":211528,"journal":{"name":"2013 IEEE 21st Symposium on Computer Arithmetic","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127559718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20