{"title":"An efficient PIM (processor-in-memory) architecture for motion estimation","authors":"Jung-Yup Kang, S. Gupta, Saurabh Shah, J. Gaudiot","doi":"10.1109/ASAP.2003.1212852","DOIUrl":"https://doi.org/10.1109/ASAP.2003.1212852","url":null,"abstract":"Motion estimation is the most time consuming stage of MPEG family encodings and it reportedly absorbs up to 90% of the total execution time of MPEG processing. Therefore, we propose a hardware/software co-design paradigm that uses a PIM module to efficiently execute motion estimation operations. We use a PIM module to reduce the memory access penalty caused by a large number of memory accesses. We segment the PIM module into small pieces so that each smaller PIM module can execute the operations in parallel fashion. However, in order to execute the operations in parallel, there are critical overheads that involve replicating a huge amount of data to many of these smaller PIM modules. Not only do these replications require a huge amount of additional memory accesses but also calculations when generating addresses. Therefore, we also present an efficient data distribution mechanism to effectively support parallel executions among these smaller PIM modules. With our paradigm, the host processor can be relieved from computationally-intensive and data-intensive workloads of motion estimation. We observed up to 2034/spl times/ improvement in reduction of the number of memory accesses and up to 439/spl times/ performance improvement for the execution of motion estimation operations when using our computing paradigm.","PeriodicalId":261592,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors. ASAP 2003","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116409541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Sima, S. Vassiliadis, S. Cotofana, J. V. Eijndhoven
{"title":"Color space conversion for MPEG decoding on FPGA-augmented TriMedia processor","authors":"M. Sima, S. Vassiliadis, S. Cotofana, J. V. Eijndhoven","doi":"10.1109/ASAP.2003.1212848","DOIUrl":"https://doi.org/10.1109/ASAP.2003.1212848","url":null,"abstract":"A case study on color space conversion (CSC) for MPEG decoding, carried out on the FPGA-augmented TriMedia processor is presented. That is, a transform from Y'CbCr color space to R'G'B' color space is addressed. First, we outline the extension of TriMedia architecture consisting of FPGA-based reconfigurable functional units (RFU) and associated instructions. Then we analyse a CSC (RFU-specific) instruction which can process four pixels per call, and propose a scheme to implement the CSC operation on RFU(s). When mapped on an ACEX EP1K100 FPGA, the proposed CSC exhibits a latency of 10 and a recovery of 2 TriMedia@200 MHz cycles, and occupies 57% of the device. By configuring the CSC facility on the RFU(s) at application load-time, color space conversion can be computed on FPGA-augmented TriMedia with 40% speed-up over the standard TriMedia.","PeriodicalId":261592,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors. ASAP 2003","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125176592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A cryptographic processor for arbitrary elliptic curves over GF(2/sup m/)","authors":"H. Eberle, N. Gura, S. C. Shantz","doi":"10.1109/ASAP.2003.1212867","DOIUrl":"https://doi.org/10.1109/ASAP.2003.1212867","url":null,"abstract":"We describe a cryptographic processor for elliptic curve cryptography (ECC). ECC is evolving as an attractive alternative to other public-key schemes such as RSA by offering the smallest key size and the highest strength per bit. The processor performs point multiplication for elliptic curves over binary polynomial fields GF(2/sup m/). In contrast to other designs that only support one curve at a time, our processor is capable of handling arbitrary curves without requiring reconfiguration. More specifically, it can handle both named curves as standardized by NIST as well as any other generic curves up to a field degree of 255. Efficient support for arbitrary curves is particularly important for the targeted server applications that need to handle requests for secure connections generated by a multitude of heterogeneous client devices. Such requests may specify curves which are infrequently used or not even known at implementation time. Our processor implements 256 bit modular multiplication, division, addition and squaring. The multiplier constitutes the core function as it executes the bulk of the point multiplication algorithm. We present a novel digit-serial modular multiplier that uses a hybrid architecture to perform the reduction operation needed to reduce the multiplication result: hardwired logic is used for fast reduction of named curves and the multiplier circuit is reused for reduction of generic curves. The performance of our FPGA-based prototype, running at a clock frequency of 66.4 MHz, is 6955 point multiplications per second for named curves over GF(2/sup 163/) and 3308 point multiplications per second for generic curves over GF(2/sup 163/).","PeriodicalId":261592,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors. ASAP 2003","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129641962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Physical planning for on-chip multiprocessor networks and switch fabrics","authors":"Terry Tao Ye, G. Micheli","doi":"10.1109/ASAP.2003.1212833","DOIUrl":"https://doi.org/10.1109/ASAP.2003.1212833","url":null,"abstract":"On-chip implementation of multiprocessor systems requires the planarization of the interconnect network onto the silicon floorplan. Manual floorplanning approaches will become increasingly more difficult and ineffective as multiprocessor complexity increases. Compared with traditional ASIC architectures, multiprocessors have homogeneous processing elements and regular network topologies. Therefore, traditional ASIC floorplanning methodologies based on macro placement are not effective in this domain. We propose an automated physical planning tool, called REGULAY, which can generate floorplans for different topologies under different design constraints. Compared with traditional floorplanning approaches, REGULAY shows significant advantages in reducing the total interconnect wire-length while preserving the regularity and hierarchy of the network topology.","PeriodicalId":261592,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors. ASAP 2003","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130655133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Goldstein, M. Budiu, M. Mishra, Girish Venkataramani
{"title":"Reconfigurable computing and electronic nanotechnology","authors":"S. Goldstein, M. Budiu, M. Mishra, Girish Venkataramani","doi":"10.1109/ASAP.2003.1212837","DOIUrl":"https://doi.org/10.1109/ASAP.2003.1212837","url":null,"abstract":"We examine the opportunities brought about by recent progress in electronic nanotechnology and describe the methods needed to harness them for building a new computer architecture. In this process we decompose some traditional abstractions, such as the transistor, into fine-grain pieces, such as signal restoration and input-output isolation. We also show how we can forgo the extreme reliability of CMOS circuits for low-cost chemical self-assembly at the expense of large manufacturing defect densities. We discuss advanced testing methods that can be used to recover perfect functionality from unreliable parts. We proceed to show how the molecular switch, the regularity of the circuits created by self-assembly and the high defect densities logically require the use of reconfigurable hardware as a basic building block for hardware design. We then capitalize on the convergence of compilation and hardware synthesis (which takes place when programming reconfigurable hardware) to propose the complete elimination of the instruction-set architecture from the system architecture, and the synthesis of asynchronous dataflow machines directly from high-level programming languages, such as C. We discuss in some detail a scalable compilation system that performs this task.","PeriodicalId":261592,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors. ASAP 2003","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133811992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Area and time efficient modular multiplication of large integers","authors":"Viktor Bunimov, M. Schimmler","doi":"10.1109/ASAP.2003.1212863","DOIUrl":"https://doi.org/10.1109/ASAP.2003.1212863","url":null,"abstract":"A new modular multiplication algorithm and its corresponding architecture is presented. It is optimised with respect to hardware complexity and latency. Based on the dataflow of the well known interleaved modular multiplication the product of two n-bit-integers X and Y modulo M is computed by n iterations of a simple loop. The loop consists of one single carry save addition, a comparison of constant complexity, and a table lookup, where the table contains 6 precomputed values and two constants. By this construction the arithmetical complexity of the modular multiplication is reduced to n additions without carry propagation in total which leads to a speedup of at least two in comparison to all methods previously known. It consists of a first algorithm A2 implementing the new idea of combining carry save addition and constant time comparison. A2 is not optimal with respect to area and time. Its correctness is proven. By use of a small amount of precomputing the loop of A2 can be modified such that the effort within the loop is minimised. This leads to the algorithm A3 and it is verified.","PeriodicalId":261592,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors. ASAP 2003","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133472903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Complex division with prescaling of operands","authors":"J. Muller","doi":"10.1109/ASAP.2003.1212854","DOIUrl":"https://doi.org/10.1109/ASAP.2003.1212854","url":null,"abstract":"We adapt the radix-r digit-recurrence division algorithm to complex division. By prescaling the operands, we make the selection of quotient digits simple. This leads to a simple hardware implementation, and allows correct rounding of complex quotient. To reduce large prescaling tables required for radices greater than 4, we adapt the bipartite-table method to multiple-operand functions.","PeriodicalId":261592,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors. ASAP 2003","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122615019","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hardware implementation of an elliptic curve processor over GF(p)","authors":"S. Yalcin, L. Batina, B. Preneel, J. Vandewalle","doi":"10.1109/ASAP.2003.1212866","DOIUrl":"https://doi.org/10.1109/ASAP.2003.1212866","url":null,"abstract":"We describe a hardware implementation of an arithmetic processor which is efficient for bit-lengths suitable for both commonly used types of public key cryptography (PKC), i.e., elliptic curve (EC) and RSA cryptosystems. Montgomery modular multiplication in a systolic array architecture is used for modular multiplication. The processor consists of special operational blocks for Montgomery modular multiplication, modular addition/subtraction, EC point doubling/addition, modular multiplicative inversion, EC point multiplier, projective to affine coordinates conversion and Montgomery to normal representation conversion.","PeriodicalId":261592,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors. ASAP 2003","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122648298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Iterative methods for logarithmic subtraction","authors":"M. Arnold","doi":"10.1109/ASAP.2003.1212855","DOIUrl":"https://doi.org/10.1109/ASAP.2003.1212855","url":null,"abstract":"The logarithmic number system (LNS) offers much better performance (in terms of power, speed and area) than floating point for multiplication, division, powers and roots. Moderate-precision addition (of like signs) in LNS generally can be done with table lookup followed by interpolation, whose implementation can be as, or more, efficient than the equivalent precision floating-point adder. The problem with LNS is the size of the table needed for subtraction. We consider iterative methods for logarithmic subtraction. The basis for the novel methods proposed here is that the subtraction logarithm is the inverse of the addition logarithm. Although the mathematics for this kind of logarithmic subtraction were first described during the time of Gauss, no modern designer has implemented an algorithm, like the one proposed here, which performs a binary search followed by an inverse interpolation. Additionally, we propose a novel initialization step for the binary search, which doubles the speed of the algorithm compared to a name, implementation. Combining the proposed method with other iterative methods may reduce the average execution time further. Synthesis results indicate the proposed methods are feasible for FPGA implementation.","PeriodicalId":261592,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors. ASAP 2003","volume":"127 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117352144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Using media processors for low-memory AES implementation","authors":"J. Irwin, D. Page","doi":"10.1109/ASAP.2003.1212838","DOIUrl":"https://doi.org/10.1109/ASAP.2003.1212838","url":null,"abstract":"Most performance studies of AES make traditional space versus time tradeoffs by allowing large lookup tables to accelerate operations that would normally be calculated by the processor. However, AES is a versatile algorithm and can also be optimised for low-memory use in constrained environments. We investigate the possibility of getting the best of both worlds - an application specific hardware and software solution that has a low dependency on memory yet still executes fast enough to consider for use in production systems. The resulting software is attractive in high level design since it allows AES to be more easily deployed as a composable element in larger systems and scale better as processor speed increases.","PeriodicalId":261592,"journal":{"name":"Proceedings IEEE International Conference on Application-Specific Systems, Architectures, and Processors. ASAP 2003","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2003-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121643519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}