{"title":"Efficient VLSI implementation of modulo (2/sup n//spl plusmn/1) addition and multiplication","authors":"R. Zimmermann","doi":"10.1109/ARITH.1999.762841","DOIUrl":"https://doi.org/10.1109/ARITH.1999.762841","url":null,"abstract":"New VLSI circuit architectures for addition and multiplication modulo (2/sup n/-1) and (2/sup n/+1) are proposed that allow the implementation of highly efficient combinational and pipelined circuits for modular arithmetic. It is shown that the parallel-prefix adder architecture is well suited to realize fast end-around-carry adders used for modulo addition. Existing modulo multiplier architectures are improved for higher speed and regularity. These allow the use of common multiplier speed-up techniques like Wallace-tree addition and Booth recoding, resulting in the fastest known modulo multipliers. Finally, a high-performance modulo multiplier-adder for the IDEA block cipher is presented. The resulting circuits are compared qualitatively and quantitatively, i.e., in a standard-cell technology, with existing solutions and ordinary integer adders and multipliers.","PeriodicalId":434169,"journal":{"name":"Proceedings 14th IEEE Symposium on Computer Arithmetic (Cat. No.99CB36336)","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130555317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Floating point division and square root algorithms and implementation in the AMD-K7/sup TM/ microprocessor","authors":"S. Oberman","doi":"10.1109/ARITH.1999.762835","DOIUrl":"https://doi.org/10.1109/ARITH.1999.762835","url":null,"abstract":"This paper presents the AMD-K7 IEEE 754 and /spl times/87 compliant floating point division and square root algorithms and implementation. The AMD-K7 processor employs an iterative implementation of a series expansion to converge quadratically to the quotient and square root. Highly accurate initial approximations and a high performance shared floating point multiplier assist in achieving low division and square root latencies at high operating frequencies. A novel time-sharing technique allows independent floating point multiplication operations to proceed while division or square root computation is in progress. Exact IEEE 754 rounding for all rounding modes and target precisions has been verified by conventional directed and random testing procedures, along with the formulation of a mechanically-checked formal proof using the ACL2 theorem prover.","PeriodicalId":434169,"journal":{"name":"Proceedings 14th IEEE Symposium on Computer Arithmetic (Cat. No.99CB36336)","volume":"69 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129919457","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Necessary and sufficient conditions for parallel, constant time conversion and addition","authors":"Peter Kornerup","doi":"10.1109/ARITH.1999.762840","DOIUrl":"https://doi.org/10.1109/ARITH.1999.762840","url":null,"abstract":"This note presents necessary and sufficient conditions for parallel and constant time conversions from one digit-set into another, and thus also for constant time addition. In the integer domain it is generally believed that such conversion and addition is possible if the target digit-set is redundant and complete. This is also the case when the digit-set is a contiguous set of integers. However, when this is not the case then such conversion and addition in the integer domain is not possible in general, and when more general rings are considered, the same problem may be present.","PeriodicalId":434169,"journal":{"name":"Proceedings 14th IEEE Symposium on Computer Arithmetic (Cat. No.99CB36336)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122750672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multiplications of floating point expansions","authors":"M. Daumas","doi":"10.1109/ARITH.1999.762851","DOIUrl":"https://doi.org/10.1109/ARITH.1999.762851","url":null,"abstract":"In modern computers, the floating point unit is the part of the processor delivering the highest computing power and getting most attention from the design team. Performance of any multiple precision application will be dramatically enhanced by adequate use of floating point expansions. We present three multiplication algorithms, faster and more integrated than the stepwise algorithm proposed earlier. We have tested these novel algorithms on an application that computes the determinant of a matrix. In the absence of overflow or underflow, the process is error free and possibly more efficient than its integer based counterpart.","PeriodicalId":434169,"journal":{"name":"Proceedings 14th IEEE Symposium on Computer Arithmetic (Cat. No.99CB36336)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124977504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Very-high radix CORDIC vectoring with scalings and selection by rounding","authors":"E. Antelo, T. Lang, J. Bruguera","doi":"10.1109/ARITH.1999.762846","DOIUrl":"https://doi.org/10.1109/ARITH.1999.762846","url":null,"abstract":"A very-high radix algorithm and implementation for circular CORDIC in vectoring mode is presented. As for division, to simplify the selection function, the operands are pre-scaled. However in the CORDIC algorithm the coordinate x varies during the execution so several scalings might be needed; we show that two scalings are sufficient. Moreover, the compensation of the variable scale factor is done by computing the logarithm of the scale factor and performing the compensation by an exponential. Estimations of the delay for 32 bit precision show a speed up of about two with respect to the radix-4 case with redundant addition. This speed up is obtained at the cost of an increase in the hardware complexity, which is moderate for the pipelined implementation.","PeriodicalId":434169,"journal":{"name":"Proceedings 14th IEEE Symposium on Computer Arithmetic (Cat. No.99CB36336)","volume":"492 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129610189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"New algorithms for improved transcendental functions on IA-64","authors":"S. Story, P. T. P. Tang","doi":"10.1109/ARITH.1999.762822","DOIUrl":"https://doi.org/10.1109/ARITH.1999.762822","url":null,"abstract":"The IA-64 architecture provides new opportunities and challenges for implementing an improved set of transcendental functions. Using several novel polynomial-based table-driven techniques, we are able to provide new algorithms for the transcendental functions. Major improvements include an accuracy level of about 0.6 ulps (units in the last place) and forward trigonometric functions that have a period of 2/spl pi/. The accuracy enhancements are achieved at improved speed, yet without an increase in the table size. In this paper, we highlight the key IA-64 architectural features that influenced our designs, and explain the main ideas used in our new algorithms.","PeriodicalId":434169,"journal":{"name":"Proceedings 14th IEEE Symposium on Computer Arithmetic (Cat. No.99CB36336)","volume":"168 12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125987620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The S/390 G5 floating point unit supporting hex and binary architectures","authors":"E. Schwarz, Ronald M. Smith, C. Krygowski","doi":"10.1109/ARITH.1999.762852","DOIUrl":"https://doi.org/10.1109/ARITH.1999.762852","url":null,"abstract":"The first high performance floating point unit to support both IBM 360 hexadecimal based floating point architecture and the IEEE 754 Standard binary floating point architecture is described. The S/390 G5 floating point unit supports the new S/390 architecture which includes hexadecimal based short, long, and extended precision formats and IEEE 754 standard single, double, and quad formats. This floating point unit is part of the microprocessor chip on the S/390 G5 mainframe computer introduced in 1998 and generally available at 500 MHz speeds. The S/390 G5 represents the current state of the art in CISC processor design. The paper describes the S/390 architecture enhancements, the internal format of the FPU, and the modifications to the FPU dataflow.","PeriodicalId":434169,"journal":{"name":"Proceedings 14th IEEE Symposium on Computer Arithmetic (Cat. No.99CB36336)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133437423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A comparison of three rounding algorithms for IEEE floating-point multiplication","authors":"G. Even, P. Seidel","doi":"10.1109/ARITH.1999.762848","DOIUrl":"https://doi.org/10.1109/ARITH.1999.762848","url":null,"abstract":"A novel IEEE compliant floating point rounding algorithm for computing the rounded product from a carry-save representation of the product is presented. The new rounding algorithm is compared with the rounding algorithms of R. Yu and G. Zyner (1995) and of N. Quach et al. (1991). For each rounding algorithm, a logical description and a block diagram is given and the latency is analyzed. We conclude that the new rounding algorithm is the fastest rounding algorithm, provided that an injection (which depends only on the rounding mode and the sign) can be added in during the reduction of the partial products into a carry-save encoded digit string. In double precision the latency of the new rounding algorithm is 12 logic levels compared to 14 logic levels in the algorithm of Quach et al., and 16 logic levels in the algorithm of Yu and Zyner.","PeriodicalId":434169,"journal":{"name":"Proceedings 14th IEEE Symposium on Computer Arithmetic (Cat. No.99CB36336)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132234151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Arithmetic with signed analog digits","authors":"A. Saed, M. Ahmadi, G. Jullien","doi":"10.1109/ARITH.1999.762838","DOIUrl":"https://doi.org/10.1109/ARITH.1999.762838","url":null,"abstract":"This paper presents mathematical foundations of the Overlap Resolution Number System (ORNS) which is based on signed continuous valued digits (CVDs). ORNS is a redundant number system employing residue arithmetic. In contrast to the implementation of arithmetic by binary or multiple-valued logic circuits, arithmetic operations in this novel number system are performed by analog digit manipulation circuitry. The redundancy in an ensemble of continuous valued digits that comprises a number provides tolerance to implementation imprecisions. Processing with these analog digits is performed by carry-free arithmetic structures with systematic circuit level redundancy.","PeriodicalId":434169,"journal":{"name":"Proceedings 14th IEEE Symposium on Computer Arithmetic (Cat. No.99CB36336)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123661728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A family of adders","authors":"S. Knowles","doi":"10.1109/ARITH.1999.762825","DOIUrl":"https://doi.org/10.1109/ARITH.1999.762825","url":null,"abstract":"Binary carry-propagating addition can be efficiently expressed as a prefix computation. Several examples of adders based on such a formulation have been published, and efficient implementations are numerous. Chief among the known constructions are those of Kogge and Stone and Ladner and Fischer. In this work we show that these are end cases of a large family of addition structures, all of which share the attractive property of minimum logical depth. The intermediate structures allow trade-offs between the amount of internal wiring and the fanout of intermediate nodes, and can thus usually achieve a more attractive combination of speed and area/power cost than either of the known end-cases. Rules for the construction of such adders are given, as are examples of realistic 32b designs implemented in an industrial Ou25 CMOS process.","PeriodicalId":434169,"journal":{"name":"Proceedings 14th IEEE Symposium on Computer Arithmetic (Cat. No.99CB36336)","volume":"193 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121847460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}