{"title":"On the design of high-radix on-line division for long precision","authors":"A. Tenca, M. Ercegovac","doi":"10.1109/ARITH.1999.762827","DOIUrl":"https://doi.org/10.1109/ARITH.1999.762827","url":null,"abstract":"We present a design of a high-radix on-line division suitable for long precision computations. The proposed scheme uses a quotient-digit selection function based on the residual rounding and scaling of the operands. The bounds on the number of cycles and the cycle time for radix 2/sup k/ and n-bit precision are obtained in terms of full-adder delays. The speedup with respect to radix 2 is greater than 3.3 for k/spl ges/6 and n/spl ges/64. The cost increases as a function of the radix. For the case r=64 and n=64, the increase in area with respect to r=2 is about 6.6 times plus a 512/spl times/10-bit table. The proposed scheme has been designed and verified using VHDL and a 1.2 /spl mu/m CMOS standard gate technology from MOSIS library.","PeriodicalId":434169,"journal":{"name":"Proceedings 14th IEEE Symposium on Computer Arithmetic (Cat. No.99CB36336)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124475714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A 32 bit logarithmic arithmetic unit and its performance compared to floating-point","authors":"J. N. Coleman, E. Chester","doi":"10.1109/ARITH.1999.762839","DOIUrl":"https://doi.org/10.1109/ARITH.1999.762839","url":null,"abstract":"As an alternative to floating-point, several papers have proposed the use of a logarithmic number system, in which a real number is represented as a fixed-point logarithm. Multiplication and division therefore proceed in minimal time with no rounding error. However, the system can only offer an overall advantage if addition and subtraction can be performed with speed and accuracy at least equal to that of floating-paint, but these operations require the interpolation of a non-linear function which has hitherto been either time-consuming or inaccurate. We present a procedure by which additions and subtractions can be performed rapidly and accurately, and show that these operations are thereby competitive with their floating-point equivalents. We then show that the average performance of the logarithmic system exceeds floating-point, in terms of both speed and accuracy.","PeriodicalId":434169,"journal":{"name":"Proceedings 14th IEEE Symposium on Computer Arithmetic (Cat. No.99CB36336)","volume":"304 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115830062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Floating-point unit in standard cell design with 116 bit wide dataflow","authors":"Guenter Gerwig, M. Kroener","doi":"10.1109/ARITH.1999.762853","DOIUrl":"https://doi.org/10.1109/ARITH.1999.762853","url":null,"abstract":"The floating point unit of a S/390 CMOS microprocessor is described. It contains a 116 bit fraction data flow for addition and subtraction and a 64 bit-wide multiplier. Besides the register array, there are no other dataflow macros used; it is fully designed with standard cell books and is placed flat with a timing driven placement algorithm. This design method allows more 'irregular' structures than usually found in custom designs. An overview of the floating point unit is given and some interesting design items are shown: a 120 bit-wide true-complement adder with precounting of leading zero digits, a signed multiplier with bit-optimized Wallace tree, intensive forwarding in source equal target cases and the checking method.","PeriodicalId":434169,"journal":{"name":"Proceedings 14th IEEE Symposium on Computer Arithmetic (Cat. No.99CB36336)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122517811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Low-power division: comparison among implementations of radix 4, 8 and 16","authors":"A. Nannarelli, T. Lang","doi":"10.1109/ARITH.1999.762829","DOIUrl":"https://doi.org/10.1109/ARITH.1999.762829","url":null,"abstract":"Although division is less frequent than addition and multiplication, because of its longer latency it dissipates a substantial part of the energy in floating-point units. In this paper we explore the relation between the radix and the energy dissipated. Previous work has been done an radix-4 and radix-8 division. Here we extend this study to a radix-4 scheme with two overlapped radix-4 stages and compare the latency, area, and energy of the three implementations. Results show that by applying the low-power techniques the energy dissipation is reduced from 30% to 40%, with respect to the standard implementation. An additional 20% reduction can be obtained using a dual voltage. Moreover the energy dissipated to complete the division is roughly the same for the three radices. However, the power dissipation, proportional to the average current, increases with the radix. If reducing the energy is the priority, for the same latency radix-16 with dual voltage produces the smallest energy dissipation.","PeriodicalId":434169,"journal":{"name":"Proceedings 14th IEEE Symposium on Computer Arithmetic (Cat. No.99CB36336)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125565612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"VLSI costs of arithmetic parallelism: a residue reverse conversion perspective","authors":"M. Bhardwaj, T. Srikanthan, C. Clarke","doi":"10.1109/ARITH.1999.762843","DOIUrl":"https://doi.org/10.1109/ARITH.1999.762843","url":null,"abstract":"This paper reports how VLSI cost metrics (area, delay, power) of residue reverse converters scale with the cardinality and dynamic range of moduli sets. The study uses CMAC reverse converters, reported previously by the authors to be the most efficient known to date in terms of area and delay. In all, 134 reverse converters with dynamic ranges from 32 to 120 bits and set cardinalities ranging from 4 to 20 are actually constructed and analyzed. It is seen that area, delay and power costs are cardinality insensitive once the cardinality exceeds a threshold (usually between five to eight). For cardinalities beyond this threshold, conversion costs are essentially dynamic range dependent. This insensitivity is explained in detail by noting the counterbalancing effects of the various sub-units of a CMAC reverse converter. Since practical implementations of RNS usually employ cardinalities beyond the abovementioned thresholds, the significance of this study is its conclusion that increasing the set cardinality in most implementations will have a marginal, if any, effect on VLSI reverse conversion costs.","PeriodicalId":434169,"journal":{"name":"Proceedings 14th IEEE Symposium on Computer Arithmetic (Cat. No.99CB36336)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134183548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Area/spl times/delay (A/spl middot/T) efficient multiplier based on an intermediate hybrid signed-digit (HSD-1) representation","authors":"Jeng-Jong J. Lue, D. Phatak","doi":"10.1109/ARITH.1999.762847","DOIUrl":"https://doi.org/10.1109/ARITH.1999.762847","url":null,"abstract":"Intermediate Signed Digit (SD) representation can facilitate fast and compact VLSI implementations of partial product accumulation trees. It achieves a reduction ratio of 2:1 at every level and also leads to more regular layouts. Its disadvantage is that the number of bit lines that need to be routed can be high. This can lead to a significant area overhead especially at smaller feature sizes where the wire/interconnect area and delay can be dominant. A Hybrid Signed Digit (HSD) representation lets some of the digits be unsigned bits, thereby reducing the number of bit lines. By arbitrarily varying the positions of and distances between consecutive signed digits, this representation can trade off latency for area and offers a continuum of choices between the two's complement representation on the one hand and fully Signed Digit (FSD or simply SD) representation on the other. We illustrate an A/spl middot/T (area/spl times/delay) efficient multiplier based on the HSD-1 representation which is one of the many possible HSD formats, wherein every alternate digit is signed and the rest are unsigned (ordinary) bits. It is seen that multipliers based on HSD-1 format require more transistors than those based on FSD format. However, they require fewer bit lines to be routed, which substantially reduces the interconnect area; thereby leading to a reduction in the total VLSI area and a lower A/spl middot/T product. The design reaffirms that the interconnect area can be significant, especially at small feature sizes.","PeriodicalId":434169,"journal":{"name":"Proceedings 14th IEEE Symposium on Computer Arithmetic (Cat. No.99CB36336)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132390589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Intermediate variable encodings that enable multiplexor-based implementations of two operand addition","authors":"D. Phatak, I. Koren","doi":"10.1109/ARITH.1999.762824","DOIUrl":"https://doi.org/10.1109/ARITH.1999.762824","url":null,"abstract":"In two operand addition, bit-wise intermediate variables such as the \"propagate\" and \"generate\" terms are defined/evaluated first. Basic carry propagation recursion is then expressed in terms of these variables and is \"unrolled\" to obtain a tree structure for fast execution. In CMOS VLSI technology, multiplexors are fast and efficient to implement. Hence, we investigate in this paper all possible two-bit encodings for the intermediate variables and identify the ones that enable multiplexor-based implementations. Some of these encodings enable further simplification of the multiplexor-based realizations. Our analysis also shows that adopting an intermediate signed-digit representation simply amounts to selecting one of the possible encodings. Thus, there is no inherent advantage to the use of intermediate signed-digit representations in a two operand addition. Finally, we extend our analysis to the generalized look-ahead-recursions proposed by R.W. Doran (1988).","PeriodicalId":434169,"journal":{"name":"Proceedings 14th IEEE Symposium on Computer Arithmetic (Cat. No.99CB36336)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122342696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reduced latency IEEE floating-point standard adder architectures","authors":"A. Beaumont-Smith, N. Burgess, S. Lefrere, C. Lim","doi":"10.1109/ARITH.1999.762826","DOIUrl":"https://doi.org/10.1109/ARITH.1999.762826","url":null,"abstract":"The design and implementation of a double precision floating-point IEEE-754 standard adder is described which uses \"flagged prefix addition\" to merge rounding with the significand addition. The floating-point adder is implemented in 0.5 /spl mu/m CMOS, measures 1.8 mm/sup 2/, has a 3-cycle latency and implements all rounding modes. A modified version of this floating-point adder can perform accumulation in 2-cycles with a small amount of extra hardware for use in a parallel processor node. This is achieved by feeding back the previous un-normalised but correctly rounded result together with the normalisation distance. A 2-cycle latency floating-point adder architecture with potentially the same cycle time that also employs flagged prefix addition is described. It also incorporates a fast prediction scheme for the true subtraction of significands with an exponent difference of 1, with one less adder.","PeriodicalId":434169,"journal":{"name":"Proceedings 14th IEEE Symposium on Computer Arithmetic (Cat. No.99CB36336)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126061674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Montgomery modular exponentiation on reconfigurable hardware","authors":"Thomas Blum","doi":"10.1109/ARITH.1999.762831","DOIUrl":"https://doi.org/10.1109/ARITH.1999.762831","url":null,"abstract":"It is widely recognized that security issues will play a crucial role in the majority of future computer and communication systems. Central tools for achieving system security are cryptographic algorithms. For performance as well as for physical security reasons, it is often advantageous to realize cryptographic algorithms in hardware. In order to overcome the well-known drawback of reduced flexibility that is associated with traditional ASIC solutions, this contribution proposes arithmetic architectures which are optimized for modern field programmable gate arrays (FPGAs). The proposed architectures perform modular exponentiation with very long integers. This operation is at the heart of many practical public-key algorithms such as RSA and discrete logarithm schemes. We combine the Montgomery modular multiplication algorithm with a new systolic array design, which is capable of processing a variable number of bits per array cell. The designs are flexible, allowing any choice of operand and modulus. Unlike previous approaches, we systematically implement and compare several variants of our new architecture for different bit lengths. We provide absolute area and timing measures for each architecture. The results allow conclusions about the feasibility and time-space trade-offs of our architecture for implementation on Xilinx XC4000 series FPGAs. As a major practical result we show that it is possible to implement modular exponentiation at secure bit lengths on a single commercially available FPGA.","PeriodicalId":434169,"journal":{"name":"Proceedings 14th IEEE Symposium on Computer Arithmetic (Cat. No.99CB36336)","volume":"37 7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125733835","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Complex logarithmic number system arithmetic using high-radix redundant CORDIC algorithms","authors":"D. Lewis","doi":"10.1109/ARITH.1999.762845","DOIUrl":"https://doi.org/10.1109/ARITH.1999.762845","url":null,"abstract":"This paper describes the application of high radix redundant CORDIC algorithms to complex logarithmic number system arithmetic. It shows that a CLNS addition can be performed with approximately the same hardware as a high-radix CORDIC operation. A design example comparable to single precision floating point has been designed and simulated.","PeriodicalId":434169,"journal":{"name":"Proceedings 14th IEEE Symposium on Computer Arithmetic (Cat. No.99CB36336)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1999-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132322326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}