2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors最新文献

筛选
英文 中文
A Power-Scalable Switch-Based Multi-processor FFT 基于功率可扩展开关的多处理器FFT
B. Mohd, E. Swartzlander
{"title":"A Power-Scalable Switch-Based Multi-processor FFT","authors":"B. Mohd, E. Swartzlander","doi":"10.1109/ASAP.2009.18","DOIUrl":"https://doi.org/10.1109/ASAP.2009.18","url":null,"abstract":"This paper examines the architecture, algorithm and implementation of a switch-based multi-processor realization of the fast Fourier transform (FFT). The architecture employs M processing elements (PEs), and provides a speedup of M compared with systems that use a single PE. An algorithm is provided to detect and resolve memory conflicts. A CMOS implementation of a four-PE processor is presented. The design is reconfigurable to compute various FFT sizes. The design power consumption is scalable based on the number of active PEs. The timing, area and power results are discussed.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"34 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116610547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
An FPGA-based Parallel Hardware Architecture for Real-Time Face Detection Using a Face Certainty Map 一种基于fpga的并行硬件结构,用于人脸确定性地图的实时人脸检测
S. Jin, Dongkyun Kim, T. Nguyen, Bongjin Jun, Daijin Kim, J. Jeon
{"title":"An FPGA-based Parallel Hardware Architecture for Real-Time Face Detection Using a Face Certainty Map","authors":"S. Jin, Dongkyun Kim, T. Nguyen, Bongjin Jun, Daijin Kim, J. Jeon","doi":"10.1109/ASAP.2009.36","DOIUrl":"https://doi.org/10.1109/ASAP.2009.36","url":null,"abstract":"This paper presents an FPGA-based parallel hardware architecture for real-time face detection. An image pyramid with twenty depth levels is generated using the input image. For these scaled-down images, a local binary pattern transform and feature evaluation are performed in parallel by using the proposed block RAM-based window processing architecture. By sharing the feature look-up tables between two corresponding scaled-down images, we can reduce the use of routing resources by half. For prototyping and evaluation purposes, the hardware architecture was integrated into a Virtex-5 FPGA. The experimental result shows around 300 frames per second speed performance for processing standard VGA (640×480×8) images. In addition, the throughput of the implementation can be adjusted in proportion to the frame rate of the camera, by synchronizing each individual module with the pixel sampling clock.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"146 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116865070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
A System Framework for the Design of Embedded Software Targeting Heterogeneous Multi-core SoCs 面向异构多核soc的嵌入式软件设计系统框架
X. Guerin, F. Pétrot
{"title":"A System Framework for the Design of Embedded Software Targeting Heterogeneous Multi-core SoCs","authors":"X. Guerin, F. Pétrot","doi":"10.1109/ASAP.2009.9","DOIUrl":"https://doi.org/10.1109/ASAP.2009.9","url":null,"abstract":"Embedded appliances designers rely on Heterogeneous Multi-Core System-on-Chips (HMC-SoC) to provide the computing power required by modern applications. Due to the inherent complexity of this kind of platform, the development of speci¿c system architectures is not considered as an option to provide low-level services to an application. Hence, the software is built either from scratch - when the software’s requirements are not too high - or over a general-purpose operating system, leading to performance and memory usage trade-offs. Our contribution is a component-based system framework that provides high-level system services for embedded software applications with few impacts on the memory usage and ¿nal performances, thanks to strong interfaces that enable the reuse of existing software elements and facilitate the support of multiple hardware platforms. The ef¿ciency of our approach is demonstrated on an existing MC-SoC.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123658889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 40
An Area-Efficient LDPC Decoder Architecture and Implementation for CMMB Systems 面向CMMB系统的面积高效LDPC解码器结构与实现
Kai Zhang, Xinming Huang, Zhongfeng Wang
{"title":"An Area-Efficient LDPC Decoder Architecture and Implementation for CMMB Systems","authors":"Kai Zhang, Xinming Huang, Zhongfeng Wang","doi":"10.1109/ASAP.2009.34","DOIUrl":"https://doi.org/10.1109/ASAP.2009.34","url":null,"abstract":"This paper presents an area-efficient LDPC decoder architecture for the China Multimedia Mobile Broadcasting (CMMB) standard. Several techniques are adopted to reduce memory size, including the min-sum algorithm (MSA), optimal bit-width quantization of the iterative messages and reduced complexity for the interconnect network. The decoder for the rate-1/2 9216-bit code is implemented using the 90nm 1.0V CMOS technology. It achieves the decoding throughput of 48Mbps at 5 iterations when operating at 60MHz and the power dissipation is only 34mW.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"218 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122398002","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Parallel Discrete Event Simulation of Molecular Dynamics Through Event-Based Decomposition 基于事件分解的分子动力学并行离散事件模拟
M. Herbordt, Md. Ashfaquzzaman Khan, T. Dean
{"title":"Parallel Discrete Event Simulation of Molecular Dynamics Through Event-Based Decomposition","authors":"M. Herbordt, Md. Ashfaquzzaman Khan, T. Dean","doi":"10.1109/ASAP.2009.39","DOIUrl":"https://doi.org/10.1109/ASAP.2009.39","url":null,"abstract":"Molecular dynamics simulation based on discrete event simulation (DMD) is emerging as an alternative to time-step driven molecular dynamics (MD). DMD uses simplified discretized models, enabling simulations to be advanced by event, with a resulting performance increase of several orders of magnitude. Even so, DMD is compute bound. Moreover, unlike MD, causality issues make DMD difficult to scale. Here we present a microarchitecture-inspired parallel algorithm for DMD: speculative execution enables multithreading, while in-order commitment ensures correctness. Our initial not-yet optimized implementation obtains scalability for a multicore processor when running realistic simulation models.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"90 11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128009456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
A High-Performance Hardware Architecture for Spectral Hash Algorithm 谱哈希算法的高性能硬件架构
R. Cheung, Ç. Koç, J. Villasenor
{"title":"A High-Performance Hardware Architecture for Spectral Hash Algorithm","authors":"R. Cheung, Ç. Koç, J. Villasenor","doi":"10.1109/ASAP.2009.31","DOIUrl":"https://doi.org/10.1109/ASAP.2009.31","url":null,"abstract":"The Spectral Hash algorithm is one of the Round 1 candidates for the SHA-3 family, and is based on spectral arithmetic over a finite field, involving multidimensional discrete Fourier transformations over a finite field, data dependent permutations, Rubic-type rotations, and affine and nonlinear functions. The underlying mathematical structures and operations pose interesting and challenging tasks for computer architects and hardware designers to create fast, efficient, and compact ASIC and FPGA realizations. In this paper, we present an efficient hardware architecture for the full 512-bit hash computation using the spectral hash algorithm. We have created a pipelined implementation on a Xilinx Virtex-4 XC4VLX200-11 FPGA which yields 100 MHz and occupies 38,328 slices, generating a throughput of 51.2 Gbps. Our fully parallel synthesized implementation shows that the spectral hash algorithm is about 100 times faster than the fastest SHA-1 implementation, while requiring only about 13 times as many logic slices.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126994027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Impact of Loop Tiling on the Controller Logic of Acceleration Engines 循环平铺对加速发动机控制器逻辑的影响
H. Dutta, J. Zhai, Frank Hannig, J. Teich
{"title":"Impact of Loop Tiling on the Controller Logic of Acceleration Engines","authors":"H. Dutta, J. Zhai, Frank Hannig, J. Teich","doi":"10.1109/ASAP.2009.21","DOIUrl":"https://doi.org/10.1109/ASAP.2009.21","url":null,"abstract":"High computational effort in modern signal and image processing applications often demands for special purpose accelerators in a system on chip (SoC). New high level synthesis methodologies enable the automated design of such programmable or non-programmable accelerators. Loop tiling is a widely used transformation in such methodologies for dimensioning of such accelerators in order to match inherent massive parallelism of considered algorithms with available functional units and processor elements. Innately, the applications are data-flow dominant and have almost no control flow, but the application of tiling techniques has the disadvantage of a more complex control and communication flow. In this paper, we present a methodology for the automatic generation of the control engines of such accelerators. The controller orchestrates the data transfer and computation. The effect of tiling on area, latency, and power overhead of the controller is studied in detail. It is shown that the controller has a substantial overhead of up to 50% in for different tiling and throughput parameters. The energy-delay product is also used as a metric for identifying optimal accelerator designs.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128680939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
A Massively Parallel Coprocessor for Convolutional Neural Networks 卷积神经网络的大规模并行协处理器
M. Sankaradass, Venkata Jakkula, S. Cadambi, S. Chakradhar, Igor Durdanovic, E. Cosatto, H. Graf
{"title":"A Massively Parallel Coprocessor for Convolutional Neural Networks","authors":"M. Sankaradass, Venkata Jakkula, S. Cadambi, S. Chakradhar, Igor Durdanovic, E. Cosatto, H. Graf","doi":"10.1109/ASAP.2009.25","DOIUrl":"https://doi.org/10.1109/ASAP.2009.25","url":null,"abstract":"We present a massively parallel coprocessor for accelerating Convolutional Neural Networks (CNNs), a class of important machine learning algorithms. The coprocessor functional units, consisting of parallel 2D convolution primitives and programmable units performing sub-sampling and non-linear functions specific to CNNs, implement a “meta-operator” to which a CNN may be compiled to. The coprocessor is serviced by distributed off-chip memory banks with large data bandwidth. As a key feature, we use low precision data and further increase the effective memory bandwidth by packing multiple words in every memory operation, and leverage the algorithm’s simple data access patterns to use off-chip memory as a scratchpad for intermediate data, critical for CNNs. A CNN is mapped to the coprocessor hardware primitives with instructions to transfer data between the memory and coprocessor. We have implemented a prototype of the CNN coprocessor on an off-the-shelf PCI FPGA card with a single Xilinx Virtex5 LX330T FPGA and 4 DDR2 memory banks totaling 1GB. The coprocessor prototype can process at the rate of 3.4 billion multiply accumulates per second (GMACs) for CNN forward propagation, a speed that is 31x faster than a software implementation on a 2.2 GHz AMD Opteron processor. For a complete face recognition application with the CNN on the coprocessor and the rest of the image processing tasks on the host, the prototype is 6-10x faster, depending on the host-coprocessor bandwidth.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116306341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 224
A Low Power High Performance Radix-4 Approximate Squaring Circuit 一种低功耗高性能基数-4近似平方电路
Satyendra R. Datla, M. Thornton, D. Matula
{"title":"A Low Power High Performance Radix-4 Approximate Squaring Circuit","authors":"Satyendra R. Datla, M. Thornton, D. Matula","doi":"10.1109/ASAP.2009.35","DOIUrl":"https://doi.org/10.1109/ASAP.2009.35","url":null,"abstract":"An implementation of a radix-4 approximate squaring circuit is described employing a new operand dual recoding technique. Approximate squaring circuits have numerous applications including use in computer graphics, digital radio modules, implementation of division and function approximation in ALU circuits. The theory of operation of the circuit is described including radix-4 operand dual recoding. Our recoding yields non negative partial squares and other features which simplify the design of the approximate squaring circuit. Results of the implementation in terms of delay, power, and area in both 130nm and 90nm technologies are presented and analyzed. The results show the circuit is power, area and performance efficient, yielding reduction factors by three or more when compared to a truncated multiplication approach using state-of-the-art logic synthesis tools. The radix-4 squaring circuit is also shown to be more efficient than a radix-2 state-of-the-art binary squaring circuit.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126039298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Low-Power ASIP Architecture Exploration and Optimization for Reed-Solomon Processing Reed-Solomon处理的低功耗ASIP架构探索与优化
Andreas Genser, Christian Bachmann, C. Steger, J. Hulzink, Mladen Berekovic
{"title":"Low-Power ASIP Architecture Exploration and Optimization for Reed-Solomon Processing","authors":"Andreas Genser, Christian Bachmann, C. Steger, J. Hulzink, Mladen Berekovic","doi":"10.1109/ASAP.2009.15","DOIUrl":"https://doi.org/10.1109/ASAP.2009.15","url":null,"abstract":"The advent of the mobile age has heavily changed the requirements of today's communication devices. Data transmission over interference-prone wireless channels requires additional steps of data processing, such as forward error correction, to ensure reliable communication. In this work we present RS(63,55) Reed-Solomon encoding and decoding algorithms according to the IEEE 802.15.4a standard executed on dedicated application-specific processor architectures. Algorithmic as well as architectural modifications to speed up execution and well-known low-power techniques to reduce the power consumption are discussed. The speed-up for our proposed designs compared to a general purpose baseline architecture is up to two orders of magnitude. Power reduction due to clock-gating and guarded evaluation results in a 40% power drop and the energy consumption is decreased up to 60x.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"667 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127093337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信