{"title":"A Power-Scalable Switch-Based Multi-processor FFT","authors":"B. Mohd, E. Swartzlander","doi":"10.1109/ASAP.2009.18","DOIUrl":"https://doi.org/10.1109/ASAP.2009.18","url":null,"abstract":"This paper examines the architecture, algorithm and implementation of a switch-based multi-processor realization of the fast Fourier transform (FFT). The architecture employs M processing elements (PEs), and provides a speedup of M compared with systems that use a single PE. An algorithm is provided to detect and resolve memory conflicts. A CMOS implementation of a four-PE processor is presented. The design is reconfigurable to compute various FFT sizes. The design power consumption is scalable based on the number of active PEs. The timing, area and power results are discussed.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"34 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116610547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Jin, Dongkyun Kim, T. Nguyen, Bongjin Jun, Daijin Kim, J. Jeon
{"title":"An FPGA-based Parallel Hardware Architecture for Real-Time Face Detection Using a Face Certainty Map","authors":"S. Jin, Dongkyun Kim, T. Nguyen, Bongjin Jun, Daijin Kim, J. Jeon","doi":"10.1109/ASAP.2009.36","DOIUrl":"https://doi.org/10.1109/ASAP.2009.36","url":null,"abstract":"This paper presents an FPGA-based parallel hardware architecture for real-time face detection. An image pyramid with twenty depth levels is generated using the input image. For these scaled-down images, a local binary pattern transform and feature evaluation are performed in parallel by using the proposed block RAM-based window processing architecture. By sharing the feature look-up tables between two corresponding scaled-down images, we can reduce the use of routing resources by half. For prototyping and evaluation purposes, the hardware architecture was integrated into a Virtex-5 FPGA. The experimental result shows around 300 frames per second speed performance for processing standard VGA (640×480×8) images. In addition, the throughput of the implementation can be adjusted in proportion to the frame rate of the camera, by synchronizing each individual module with the pixel sampling clock.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"146 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116865070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A System Framework for the Design of Embedded Software Targeting Heterogeneous Multi-core SoCs","authors":"X. Guerin, F. Pétrot","doi":"10.1109/ASAP.2009.9","DOIUrl":"https://doi.org/10.1109/ASAP.2009.9","url":null,"abstract":"Embedded appliances designers rely on Heterogeneous Multi-Core System-on-Chips (HMC-SoC) to provide the computing power required by modern applications. Due to the inherent complexity of this kind of platform, the development of speci¿c system architectures is not considered as an option to provide low-level services to an application. Hence, the software is built either from scratch - when the software’s requirements are not too high - or over a general-purpose operating system, leading to performance and memory usage trade-offs. Our contribution is a component-based system framework that provides high-level system services for embedded software applications with few impacts on the memory usage and ¿nal performances, thanks to strong interfaces that enable the reuse of existing software elements and facilitate the support of multiple hardware platforms. The ef¿ciency of our approach is demonstrated on an existing MC-SoC.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123658889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Area-Efficient LDPC Decoder Architecture and Implementation for CMMB Systems","authors":"Kai Zhang, Xinming Huang, Zhongfeng Wang","doi":"10.1109/ASAP.2009.34","DOIUrl":"https://doi.org/10.1109/ASAP.2009.34","url":null,"abstract":"This paper presents an area-efficient LDPC decoder architecture for the China Multimedia Mobile Broadcasting (CMMB) standard. Several techniques are adopted to reduce memory size, including the min-sum algorithm (MSA), optimal bit-width quantization of the iterative messages and reduced complexity for the interconnect network. The decoder for the rate-1/2 9216-bit code is implemented using the 90nm 1.0V CMOS technology. It achieves the decoding throughput of 48Mbps at 5 iterations when operating at 60MHz and the power dissipation is only 34mW.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"218 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122398002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel Discrete Event Simulation of Molecular Dynamics Through Event-Based Decomposition","authors":"M. Herbordt, Md. Ashfaquzzaman Khan, T. Dean","doi":"10.1109/ASAP.2009.39","DOIUrl":"https://doi.org/10.1109/ASAP.2009.39","url":null,"abstract":"Molecular dynamics simulation based on discrete event simulation (DMD) is emerging as an alternative to time-step driven molecular dynamics (MD). DMD uses simplified discretized models, enabling simulations to be advanced by event, with a resulting performance increase of several orders of magnitude. Even so, DMD is compute bound. Moreover, unlike MD, causality issues make DMD difficult to scale. Here we present a microarchitecture-inspired parallel algorithm for DMD: speculative execution enables multithreading, while in-order commitment ensures correctness. Our initial not-yet optimized implementation obtains scalability for a multicore processor when running realistic simulation models.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"90 11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128009456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A High-Performance Hardware Architecture for Spectral Hash Algorithm","authors":"R. Cheung, Ç. Koç, J. Villasenor","doi":"10.1109/ASAP.2009.31","DOIUrl":"https://doi.org/10.1109/ASAP.2009.31","url":null,"abstract":"The Spectral Hash algorithm is one of the Round 1 candidates for the SHA-3 family, and is based on spectral arithmetic over a finite field, involving multidimensional discrete Fourier transformations over a finite field, data dependent permutations, Rubic-type rotations, and affine and nonlinear functions. The underlying mathematical structures and operations pose interesting and challenging tasks for computer architects and hardware designers to create fast, efficient, and compact ASIC and FPGA realizations. In this paper, we present an efficient hardware architecture for the full 512-bit hash computation using the spectral hash algorithm. We have created a pipelined implementation on a Xilinx Virtex-4 XC4VLX200-11 FPGA which yields 100 MHz and occupies 38,328 slices, generating a throughput of 51.2 Gbps. Our fully parallel synthesized implementation shows that the spectral hash algorithm is about 100 times faster than the fastest SHA-1 implementation, while requiring only about 13 times as many logic slices.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126994027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Impact of Loop Tiling on the Controller Logic of Acceleration Engines","authors":"H. Dutta, J. Zhai, Frank Hannig, J. Teich","doi":"10.1109/ASAP.2009.21","DOIUrl":"https://doi.org/10.1109/ASAP.2009.21","url":null,"abstract":"High computational effort in modern signal and image processing applications often demands for special purpose accelerators in a system on chip (SoC). New high level synthesis methodologies enable the automated design of such programmable or non-programmable accelerators. Loop tiling is a widely used transformation in such methodologies for dimensioning of such accelerators in order to match inherent massive parallelism of considered algorithms with available functional units and processor elements. Innately, the applications are data-flow dominant and have almost no control flow, but the application of tiling techniques has the disadvantage of a more complex control and communication flow. In this paper, we present a methodology for the automatic generation of the control engines of such accelerators. The controller orchestrates the data transfer and computation. The effect of tiling on area, latency, and power overhead of the controller is studied in detail. It is shown that the controller has a substantial overhead of up to 50% in for different tiling and throughput parameters. The energy-delay product is also used as a metric for identifying optimal accelerator designs.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128680939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. Sankaradass, Venkata Jakkula, S. Cadambi, S. Chakradhar, Igor Durdanovic, E. Cosatto, H. Graf
{"title":"A Massively Parallel Coprocessor for Convolutional Neural Networks","authors":"M. Sankaradass, Venkata Jakkula, S. Cadambi, S. Chakradhar, Igor Durdanovic, E. Cosatto, H. Graf","doi":"10.1109/ASAP.2009.25","DOIUrl":"https://doi.org/10.1109/ASAP.2009.25","url":null,"abstract":"We present a massively parallel coprocessor for accelerating Convolutional Neural Networks (CNNs), a class of important machine learning algorithms. The coprocessor functional units, consisting of parallel 2D convolution primitives and programmable units performing sub-sampling and non-linear functions specific to CNNs, implement a “meta-operator” to which a CNN may be compiled to. The coprocessor is serviced by distributed off-chip memory banks with large data bandwidth. As a key feature, we use low precision data and further increase the effective memory bandwidth by packing multiple words in every memory operation, and leverage the algorithm’s simple data access patterns to use off-chip memory as a scratchpad for intermediate data, critical for CNNs. A CNN is mapped to the coprocessor hardware primitives with instructions to transfer data between the memory and coprocessor. We have implemented a prototype of the CNN coprocessor on an off-the-shelf PCI FPGA card with a single Xilinx Virtex5 LX330T FPGA and 4 DDR2 memory banks totaling 1GB. The coprocessor prototype can process at the rate of 3.4 billion multiply accumulates per second (GMACs) for CNN forward propagation, a speed that is 31x faster than a software implementation on a 2.2 GHz AMD Opteron processor. For a complete face recognition application with the CNN on the coprocessor and the rest of the image processing tasks on the host, the prototype is 6-10x faster, depending on the host-coprocessor bandwidth.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116306341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Low Power High Performance Radix-4 Approximate Squaring Circuit","authors":"Satyendra R. Datla, M. Thornton, D. Matula","doi":"10.1109/ASAP.2009.35","DOIUrl":"https://doi.org/10.1109/ASAP.2009.35","url":null,"abstract":"An implementation of a radix-4 approximate squaring circuit is described employing a new operand dual recoding technique. Approximate squaring circuits have numerous applications including use in computer graphics, digital radio modules, implementation of division and function approximation in ALU circuits. The theory of operation of the circuit is described including radix-4 operand dual recoding. Our recoding yields non negative partial squares and other features which simplify the design of the approximate squaring circuit. Results of the implementation in terms of delay, power, and area in both 130nm and 90nm technologies are presented and analyzed. The results show the circuit is power, area and performance efficient, yielding reduction factors by three or more when compared to a truncated multiplication approach using state-of-the-art logic synthesis tools. The radix-4 squaring circuit is also shown to be more efficient than a radix-2 state-of-the-art binary squaring circuit.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126039298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andreas Genser, Christian Bachmann, C. Steger, J. Hulzink, Mladen Berekovic
{"title":"Low-Power ASIP Architecture Exploration and Optimization for Reed-Solomon Processing","authors":"Andreas Genser, Christian Bachmann, C. Steger, J. Hulzink, Mladen Berekovic","doi":"10.1109/ASAP.2009.15","DOIUrl":"https://doi.org/10.1109/ASAP.2009.15","url":null,"abstract":"The advent of the mobile age has heavily changed the requirements of today's communication devices. Data transmission over interference-prone wireless channels requires additional steps of data processing, such as forward error correction, to ensure reliable communication. In this work we present RS(63,55) Reed-Solomon encoding and decoding algorithms according to the IEEE 802.15.4a standard executed on dedicated application-specific processor architectures. Algorithmic as well as architectural modifications to speed up execution and well-known low-power techniques to reduce the power consumption are discussed. The speed-up for our proposed designs compared to a general purpose baseline architecture is up to two orders of magnitude. Power reduction due to clock-gating and guarded evaluation results in a 40% power drop and the energy consumption is decreased up to 60x.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"667 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127093337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}