2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors最新文献_第2页

A Power-Scalable Switch-Based Multi-processor FFT 基于功率可扩展开关的多处理器FFT

2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors Pub Date : 2009-07-07 DOI: 10.1109/ASAP.2009.18

B. Mohd, E. Swartzlander

引用次数: 2

An FPGA-based Parallel Hardware Architecture for Real-Time Face Detection Using a Face Certainty Map 一种基于fpga的并行硬件结构，用于人脸确定性地图的实时人脸检测

2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors Pub Date : 2009-07-07 DOI: 10.1109/ASAP.2009.36

S. Jin, Dongkyun Kim, T. Nguyen, Bongjin Jun, Daijin Kim, J. Jeon

引用次数: 17

A System Framework for the Design of Embedded Software Targeting Heterogeneous Multi-core SoCs 面向异构多核soc的嵌入式软件设计系统框架

2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors Pub Date : 2009-07-07 DOI: 10.1109/ASAP.2009.9

X. Guerin, F. Pétrot

引用次数: 40

An Area-Efficient LDPC Decoder Architecture and Implementation for CMMB Systems 面向CMMB系统的面积高效LDPC解码器结构与实现

2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors Pub Date : 2009-07-07 DOI: 10.1109/ASAP.2009.34

Kai Zhang, Xinming Huang, Zhongfeng Wang

引用次数: 8

Parallel Discrete Event Simulation of Molecular Dynamics Through Event-Based Decomposition 基于事件分解的分子动力学并行离散事件模拟

2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors Pub Date : 2009-07-07 DOI: 10.1109/ASAP.2009.39

M. Herbordt, Md. Ashfaquzzaman Khan, T. Dean

引用次数: 13

A High-Performance Hardware Architecture for Spectral Hash Algorithm 谱哈希算法的高性能硬件架构

2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors Pub Date : 2009-07-07 DOI: 10.1109/ASAP.2009.31

R. Cheung, Ç. Koç, J. Villasenor

引用次数: 2

Impact of Loop Tiling on the Controller Logic of Acceleration Engines 循环平铺对加速发动机控制器逻辑的影响

2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors Pub Date : 2009-07-07 DOI: 10.1109/ASAP.2009.21

H. Dutta, J. Zhai, Frank Hannig, J. Teich

{"title":"Impact of Loop Tiling on the Controller Logic of Acceleration Engines","authors":"H. Dutta, J. Zhai, Frank Hannig, J. Teich","doi":"10.1109/ASAP.2009.21","DOIUrl":"https://doi.org/10.1109/ASAP.2009.21","url":null,"abstract":"High computational effort in modern signal and image processing applications often demands for special purpose accelerators in a system on chip (SoC). New high level synthesis methodologies enable the automated design of such programmable or non-programmable accelerators. Loop tiling is a widely used transformation in such methodologies for dimensioning of such accelerators in order to match inherent massive parallelism of considered algorithms with available functional units and processor elements. Innately, the applications are data-flow dominant and have almost no control flow, but the application of tiling techniques has the disadvantage of a more complex control and communication flow. In this paper, we present a methodology for the automatic generation of the control engines of such accelerators. The controller orchestrates the data transfer and computation. The effect of tiling on area, latency, and power overhead of the controller is studied in detail. It is shown that the controller has a substantial overhead of up to 50% in for different tiling and throughput parameters. The energy-delay product is also used as a metric for identifying optimal accelerator designs.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128680939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

A Massively Parallel Coprocessor for Convolutional Neural Networks 卷积神经网络的大规模并行协处理器

2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors Pub Date : 2009-07-07 DOI: 10.1109/ASAP.2009.25

M. Sankaradass, Venkata Jakkula, S. Cadambi, S. Chakradhar, Igor Durdanovic, E. Cosatto, H. Graf

{"title":"A Massively Parallel Coprocessor for Convolutional Neural Networks","authors":"M. Sankaradass, Venkata Jakkula, S. Cadambi, S. Chakradhar, Igor Durdanovic, E. Cosatto, H. Graf","doi":"10.1109/ASAP.2009.25","DOIUrl":"https://doi.org/10.1109/ASAP.2009.25","url":null,"abstract":"We present a massively parallel coprocessor for accelerating Convolutional Neural Networks (CNNs), a class of important machine learning algorithms. The coprocessor functional units, consisting of parallel 2D convolution primitives and programmable units performing sub-sampling and non-linear functions specific to CNNs, implement a “meta-operator” to which a CNN may be compiled to. The coprocessor is serviced by distributed off-chip memory banks with large data bandwidth. As a key feature, we use low precision data and further increase the effective memory bandwidth by packing multiple words in every memory operation, and leverage the algorithm’s simple data access patterns to use off-chip memory as a scratchpad for intermediate data, critical for CNNs. A CNN is mapped to the coprocessor hardware primitives with instructions to transfer data between the memory and coprocessor. We have implemented a prototype of the CNN coprocessor on an off-the-shelf PCI FPGA card with a single Xilinx Virtex5 LX330T FPGA and 4 DDR2 memory banks totaling 1GB. The coprocessor prototype can process at the rate of 3.4 billion multiply accumulates per second (GMACs) for CNN forward propagation, a speed that is 31x faster than a software implementation on a 2.2 GHz AMD Opteron processor. For a complete face recognition application with the CNN on the coprocessor and the rest of the image processing tasks on the host, the prototype is 6-10x faster, depending on the host-coprocessor bandwidth.","PeriodicalId":202421,"journal":{"name":"2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-07-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116306341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 224

A Low Power High Performance Radix-4 Approximate Squaring Circuit 一种低功耗高性能基数-4近似平方电路

2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors Pub Date : 2009-07-07 DOI: 10.1109/ASAP.2009.35

Satyendra R. Datla, M. Thornton, D. Matula

引用次数: 22

Low-Power ASIP Architecture Exploration and Optimization for Reed-Solomon Processing Reed-Solomon处理的低功耗ASIP架构探索与优化

2009 20th IEEE International Conference on Application-specific Systems, Architectures and Processors Pub Date : 2009-07-07 DOI: 10.1109/ASAP.2009.15

Andreas Genser, Christian Bachmann, C. Steger, J. Hulzink, Mladen Berekovic

引用次数: 6