{"title":"Hardware synthesis for multi-dimensional time","authors":"A. Guillou, P. Quinton, T. Risset","doi":"10.1109/ASAP.2003.1212828","DOIUrl":"https://doi.org/10.1109/ASAP.2003.1212828","url":null,"abstract":"We introduce some basic principles for extending the classical systolic synthesis methodology to multidimensional time. Multidimensional scheduling enables complex algorithms that do not admit linear schedules to be parallelized, but it also requires the use of memories in the architecture. We explain how to obtain compatible allocation and memory functions for VLSI (or SIMD-like code) generation. We also present an original mechanism for controlling a VLSI architecture that has a multidimensional schedule. A structural VHDL code has been derived and synthesized (for implementation on FPGA platforms) using these systematic design principles. These results are preliminary steps to the hardware synthesis for multidimensional time.","PeriodicalId":261592,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors. ASAP 2003","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121891619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance-improved computation of very large word-length LNS addition/subtraction using signed-digit arithmetic","authors":"Chichyang Chen, Rui-Lin Chen","doi":"10.1109/ASAP.2003.1212857","DOIUrl":"https://doi.org/10.1109/ASAP.2003.1212857","url":null,"abstract":"Pipelined computation of very large word-length LNS addition/subtraction requires a significant amount of hardware and long pipeline latency. We propose a base-e exponential algorithm to simplify the exponential computation and to replace half of the pipeline stages by multiplication-and-accumulate operations. By using this approach, the circuit cost of the previously proposed 64 bit pipelined LNS addition/subtraction unit can be reduced by more than fifty percent. We also developed signed-digit (SD) algorithms to further enhance the performance of the LNS computation. From our analysis, the throughput of the 64 bit LNS unit can be increased by a factor of 4.62, and the pipeline latency can be reduced by a factor of seven. Furthermore, this SD approach can still save more than 50% of the table size and 27.6% of the circuit of the previously proposed LNS unit. The proposed approaches and algorithms have been verified by comprehensive simulations on the designed 32 bit SD hardware-reduced LNS unit. We have concluded that the proposed approaches can significantly improve the performance of very large word-length LNS addition/subtraction computation.","PeriodicalId":261592,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors. ASAP 2003","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128926480","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Arbitrary bit permutations in one or two cycles","authors":"Z. Shi, Xiao Yang, R. Lee","doi":"10.1109/ASAP.2003.1212847","DOIUrl":"https://doi.org/10.1109/ASAP.2003.1212847","url":null,"abstract":"Symmetric-key block ciphers encrypt data, providing data confidentiality over the public Internet. For interoperability reasons, it is desirable to support a variety of symmetric-key ciphers efficiently. We show the basic operations performed by a variety of symmetric-key cryptography algorithms. Of these basic operations, only bit permutation is very slow using existing processors, followed by integer multiplication. New instructions have been proposed recently to accelerate bit permutations in general-purpose processors, reducing the instructions needed to achieve an arbitrary n-bit permutation from O(n) to O(log(n)). However, the serial data-dependency between these log(n) permutation instructions prevents them from being executed in fewer than log(n) cycles, even on superscalar processors. Since application specific instruction processors (ASIPs) have fewer constraints on maintaining standard processor datapath and control conventions, can we achieve even faster permutations? We propose six alternative ASIP approaches to achieve arbitrary 64 bit permutations in one or two cycles, using new BFLY and IBFLY instructions. This reduction to one or two cycles is achieved without increasing the cycle time. We compare the latencies of different permutation units in a technology independent way to estimate cycle time impact. We also compare the alternative ASIP architectures and their efficiency in performing arbitrary 64 bit permutations.","PeriodicalId":261592,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors. ASAP 2003","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125540371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Application-specific DSP architecture for fast Fourier transform","authors":"K. L. Heo, Sung M. Cho, J. H. Lee, M. Sunwoo","doi":"10.1109/ASAP.2003.1212860","DOIUrl":"https://doi.org/10.1109/ASAP.2003.1212860","url":null,"abstract":"We present ASDSP (application-specific digital signal processor) instructions and their hardware architecture for high-speed FFT. The proposed instructions calculate a butterfly within two cycles. The proposed architecture employs a data processing unit (DPU) supporting the instructions and an FFT address generation unit (FAGU) automatically calculating the butterfly input and output data addresses. The proposed DPU has a smaller area than commercial DSP chips. Moreover, the number of FFT computation cycles is reduced by the proposed FAGU. The architecture has been modeled by the VHDL. We have used the UMC 0.25/spl square/standard cell library for logic synthesis. Performance comparisons show that the number of execution cycles is reduced over 10% and the size of the DPU decreases about 30% compared with Carmel DSP.","PeriodicalId":261592,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors. ASAP 2003","volume":"797 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123006203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An architecture for a radix-4 modular pipeline fast Fourier transform","authors":"A. El-Khashab, E. Swartzlander","doi":"10.1109/ASAP.2003.1212861","DOIUrl":"https://doi.org/10.1109/ASAP.2003.1212861","url":null,"abstract":"We present a radix-4 modular pipeline architecture for computing the discrete Fourier transform (DFT). For an N-point DFT, two conventional pipeline /spl radic/N-point fast Fourier transform (FFT) modules are joined by a specialized center element. The center element contains memories, coefficient ROMs, multipliers, and control logic. Compared with a standard N-point pipeline FFT, the modular FFT significantly reduces the number of delay lines to 2/spl radic/N. Further, the coefficient storage is concentrated within the center element, thereby reducing the ROM requirement within the pipeline FFT modules. The centralized memory and address generator provide data storage and reordering. The architecture has been analyzed through simulation and compared to the conventional pipeline FFT. The throughput of a standard radix-4 pipeline FFT is maintained with a slightly higher end-to-end latency. A reduction in power is achieved because the modular pipeline exhibits N/2 bit transitions on each clock as compared to y bit transitions in the conventional pipeline.","PeriodicalId":261592,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors. ASAP 2003","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127817252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Schulte, L.P. Marquette, S. Krithivasan, E. G. Walters, C. Glossner
{"title":"Combined multiplication and sum-of-squares units","authors":"M. Schulte, L.P. Marquette, S. Krithivasan, E. G. Walters, C. Glossner","doi":"10.1109/ASAP.2003.1212844","DOIUrl":"https://doi.org/10.1109/ASAP.2003.1212844","url":null,"abstract":"Multiplication and squaring are important operations in digital signal processing and multimedia applications. We present designs for units that implement either multiplication, A/spl times/B, or sum-of-squares computations, A/sup 2/+B/sup 2/, based on an input control signal. Compared to conventional parallel multipliers, these units have a modest increase in area and delay, but allow either multiplication or sum-of-squares computations to be performed. Combined multiplication and sum-of-squares units for unsigned and two's complement operands are presented, along with integrated designs that can operate on either unsigned or two's complement operands. The designs can also be extended to work with a third accumulator operand to compute either Z+A/spl times/B or Z+A/sup 2/+B/sup 2/. Synthesis results indicate that a combined multiplication and sum-of-squares unit for 32-bit two's complement operands can be implemented with roughly 15% more area and nearly the same worst case delay as a conventional 32-bit two's complement multiplier.","PeriodicalId":261592,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors. ASAP 2003","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127891624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Instruction set extension for fast elliptic curve cryptography over binary finite fields GF(2/sup m/)","authors":"J. Großschädl, Guy-Armand Kamendje","doi":"10.1109/ASAP.2003.1212868","DOIUrl":"https://doi.org/10.1109/ASAP.2003.1212868","url":null,"abstract":"The performance of elliptic curve (EC) cryptosystems depends essentially on efficient arithmetic in the underlying finite field. Binary finite fields GF(2/sup m/) have the advantage of \"carry-free\" addition. Multiplication, on the other hand, is rather costly since polynomial arithmetic is not supported by general-purpose processors. We propose a combined hardware/software approach to overcome this problem. First, we outline that multiplication of binary polynomials can be easily integrated into a multiplier datapath for integers without significant additional hardware. Then, we present new algorithms for multiple-precision arithmetic in GF(2/sup m/) based on the availability of an instruction for single-precision multiplication of binary polynomials. The proposed hardware/software approach is considerably faster than a \"conventional\" software implementation and well suited for constrained devices like smart cards. Our experimental results show that an enhanced 16 bit RISC processor is able to generate a 191 bit ECDSA signature in less than 650 msec when the core is clocked at 5 MHz.","PeriodicalId":261592,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors. ASAP 2003","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126846128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andreas Wieferink, Tim Kogel, A. Nohl, A. Hoffmann
{"title":"A generic tool-set for SoC multiprocessor debugging and synchronization","authors":"Andreas Wieferink, Tim Kogel, A. Nohl, A. Hoffmann","doi":"10.1109/ASAP.2003.1212840","DOIUrl":"https://doi.org/10.1109/ASAP.2003.1212840","url":null,"abstract":"Current and future SoC designs will contain an increasing number of programmable units. To be able to tailor and debug these processors in their system context at the highest possible overall simulation speed, we propose a methodology and the necessary tooling for a multiprocessor debugging environment which allows a flexible runtime tradeoff between observability and simulation speed. This approach has been applied on a complex SoC case study.","PeriodicalId":261592,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors. ASAP 2003","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115362766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic instruction set extension and utilization for embedded processors","authors":"A. Peymandoust, L. Pozzi, P. Ienne, G. Micheli","doi":"10.1109/ASAP.2003.1212834","DOIUrl":"https://doi.org/10.1109/ASAP.2003.1212834","url":null,"abstract":"There is a growing demand for application-specific embedded processors in system-on-a-chip designs. Current tools and design methodologies often require designers to manually specialize the processor based on an application. Moreover, the use of the new complex instructions added to the processor is often left to designers' ingenuity. We present a solution that automatically groups dataflow operations in the application software as potential new complex instructions. The set of possible instructions is then automatically used for code generation combined with high-level arithmetic optimizations using symbolic algebra. Symbolic arithmetic manipulations provide a novel and effective method for instruction selection that is necessary due to the complexity of the automatically identified instructions. We have used our methodology to automatically add new instructions to Tensilica processors for a set of examples. Our results show that our tools improve designers productivity and efficiently specialize an embedded processor for the given application such that the execution time is greatly improved.","PeriodicalId":261592,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors. ASAP 2003","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115606791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Decimal multiplication via carry-save addition","authors":"M. A. Erle, M. Schulte","doi":"10.1109/ASAP.2003.1212858","DOIUrl":"https://doi.org/10.1109/ASAP.2003.1212858","url":null,"abstract":"Decimal multiplication is important in many commercial applications including financial analysis, banking, tax calculation, currency conversion, insurance, and accounting. We present two novel designs for fixed-point decimal multiplication that utilize decimal carry-save addition to reduce the critical path delay. First, a multiplier that stores a reduced number of multiplicand multiples and uses decimal carry-save addition in the iterative portion of the design is presented. Then, a second multiplier design is proposed with several notable improvements including fast generation of multiplicand multiples that do not need to be stored, the use of decimal (4:2) compressors, and a simplified decimal carry-propagate addition to produce the final product. When multiplying two n-digit operands to produce a 2n-digit product, the improved multiplier design has a worst-case latency of n+4 cycles and an initiation interval of n+1 cycles. Three data-dependent optimizations, which help reduce the multipliers' average latency, are also described. The multipliers presented can be extended to support decimal floating-point multiplication.","PeriodicalId":261592,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors. ASAP 2003","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130744305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}