2011 IEEE 9th Symposium on Application Specific Processors (SASP)最新文献_第2页

USHA: Unified software and hardware architecture for video decoding USHA:用于视频解码的统一软硬件架构

2011 IEEE 9th Symposium on Application Specific Processors (SASP) Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941074

Adarsha Rao, S. Nandy, Hristo Nikolov, E. Deprettere

{"title":"USHA: Unified software and hardware architecture for video decoding","authors":"Adarsha Rao, S. Nandy, Hristo Nikolov, E. Deprettere","doi":"10.1109/SASP.2011.5941074","DOIUrl":"https://doi.org/10.1109/SASP.2011.5941074","url":null,"abstract":"Video decoders used in emerging applications need to be flexible to handle a large variety of video formats and deliver scalable performance to handle wide variations in workloads. In this paper we propose a unified software and hardware architecture for video decoding to achieve scalable performance with flexibility. The light weight processor tiles and the reconfigurable hardware tiles in our architecture enable software and hardware implementations to co-exist, while a programmable interconnect enables dynamic interconnection of the tiles. Our process network oriented compilation flow achieves realization agnostic application partitioning and enables seamless migration across uniprocessor, multi-processor, semi hardware and full hardware implementations of a video decoder. An application quality of service aware scheduler monitors and controls the operation of the entire system. We prove the concept through a prototype of the architecture on an off-the-shelf FPGA. The FPGA prototype shows a scaling in performance from QCIF to 1080p resolutions in four discrete steps. We also demonstrate that the reconfiguration time is short enough to allow migration from one configuration to the other without any frame loss.","PeriodicalId":375788,"journal":{"name":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115850435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Integrating formal verification and high-level processor pipeline synthesis 集成形式化验证和高级处理器管道合成

2011 IEEE 9th Symposium on Application Specific Processors (SASP) Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941073

E. Nurvitadhi, J. Hoe, T. Kam, Shih-Lien Lu

引用次数: 1

Frameworks for GPU Accelerators: A comprehensive evaluation using 2D/3D image registration GPU加速器框架:使用2D/3D图像配准的综合评估

2011 IEEE 9th Symposium on Application Specific Processors (SASP) Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941083

Richard Membarth, Frank Hannig, J. Teich, M. Körner, Wieland Eckert

引用次数: 20

A massively parallel implementation of QC-LDPC decoder on GPU QC-LDPC解码器在GPU上的大规模并行实现

2011 IEEE 9th Symposium on Application Specific Processors (SASP) Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941084

Guohui Wang, Michael Wu, Yang Sun, Joseph R. Cavallaro

引用次数: 60

FPGA based parallel architecture implementation of Stacked Error Diffusion algorithm 基于FPGA并行架构的堆叠误差扩散算法实现

2011 IEEE 9th Symposium on Application Specific Processors (SASP) Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941080

R. Venugopal, J. Heath, D. Lau

{"title":"FPGA based parallel architecture implementation of Stacked Error Diffusion algorithm","authors":"R. Venugopal, J. Heath, D. Lau","doi":"10.1109/SASP.2011.5941080","DOIUrl":"https://doi.org/10.1109/SASP.2011.5941080","url":null,"abstract":"Digital halftoning is a crucial technique used in digital printers to convert a continuous-tone image into a pattern of black and white dots. Halftoning is used since printers have a limited availability of inks and cannot reproduce all the color intensities in a continuous image. Error Diffusion is an algorithm in halftoning that iteratively quantizes pixels in a neighborhood dependent fashion. This manuscript focuses on the development, design and Hardware Description Language (HDL) functional and performance simulation validation of a parallel scalable hardware architecture for high performance implementation of a high quality Stacked Error Diffusion algorithm. A CMYK printer, utilizing the high quality error diffusion algorithm, would be required to execute error diffusion 16 times per pixel, resulting in a potentially high computational cost. The algorithm, originally described in ‘C’, requires a significant processing time when implemented on a conventional single Central Processing Unit (CPU) based computer system. Thus, a new scalable high performance parallel hardware processor architecture is developed to implement the algorithm and is implemented to and tested on a single Programmable Logic Device (PLD) based Field Programmable Gate Array (FPGA) chip. There is a significant decrease in the run time of the algorithm when run on the newly proposed parallel architecture implemented to FPGA technology compared to execution on a single CPU based system.","PeriodicalId":375788,"journal":{"name":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132609094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

ISIS: An accelerator for Sphinx speech recognition ISIS:斯芬克斯语音识别加速器

2011 IEEE 9th Symposium on Application Specific Processors (SASP) Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941078

A. Chun, Jenny X. Chang, Zhen Fang, Ravishankar R. Iyer, M. Deisher

引用次数: 8

Memory-efficient volume ray tracing on GPU for radiotherapy 基于GPU的高内存体积射线追踪

2011 IEEE 9th Symposium on Application Specific Processors (SASP) Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941076

Bo Zhou, X. Hu, D. Chen

{"title":"Memory-efficient volume ray tracing on GPU for radiotherapy","authors":"Bo Zhou, X. Hu, D. Chen","doi":"10.1109/SASP.2011.5941076","DOIUrl":"https://doi.org/10.1109/SASP.2011.5941076","url":null,"abstract":"Ray tracing within a uniform grid volume is a fundamental process invoked frequently by many radiation dose calculation methods in radiotherapy. Recent advances of the graphics processing units (GPU) help real-time dose calculation become a reachable goal. However, the performance of the known GPU methods for volume ray tracing is all bounded by the memory-throughput, which leads to inefficient usage of the GPU computational capacity. This paper introduces a simple yet effective ray tracing technique aiming to improve the memory bandwidth utilization of GPU for processing a massive number of rays. The idea is to exploit the coherent relationship between the rays and match the ray tracing behavior with the underlying characteristics of the GPU memory system. The proposed method has been evaluated on 4 phantom setups using randomly generated rays. The collapsed-cone convolution/superposition (CCCS) dose calculation method is also implemented with/without the proposed approach to verify the feasibility of our method. Compared with the direct GPU implementation of the popular 3DDDA algorithm, the new method provides a speedup in the range of 1.8–2.7X for the given phantom settings. Major performance factors such as ray origins, phantom sizes, and pyramid sizes are also analyzed. The proposed technique was also shown to lead to a speedup of 1.3–1.6X over the original GPU implementation of the CCCS algorithm.","PeriodicalId":375788,"journal":{"name":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127756601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

A parallel accelerator for semantic search 语义搜索的并行加速器

2011 IEEE 9th Symposium on Application Specific Processors (SASP) Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941090

Abhinandan Majumdar, S. Cadambi, S. Chakradhar, H. Graf

{"title":"A parallel accelerator for semantic search","authors":"Abhinandan Majumdar, S. Cadambi, S. Chakradhar, H. Graf","doi":"10.1109/SASP.2011.5941090","DOIUrl":"https://doi.org/10.1109/SASP.2011.5941090","url":null,"abstract":"Semantic text analysis is a technique used in advertisement placement, cognitive databases and search engines. With increasing amounts of data and stringent response-time requirements, improving the underlying implementation of semantic analysis becomes critical. To this end, we look at Supervised Semantic Indexing (SSI), a recently proposed algorithm for semantic analysis. SSI ranks a large number of documents based on their semantic similarity to a text query. For each query, it computes millions of dot products on unstructured data, generates a large intermediate result, and then performs ranking. SSI underperforms on both state-of-the-art multi-cores as well as GPUs. Its performance scalability on multi-cores is hampered by their limited support for fine-grained data parallelism. GPUs, though beat multi-cores by running thousands of threads, cannot handle large intermediate data because of their small on-chip memory. Motivated by this, we present an FPGA-based hardware accelerator for semantic analysis. As a key feature, the accelerator combines hundreds of simple processing elements together with in-memory processing to simultaneously generate and process (consume) the large intermediate data. It also supports “dynamic parallelism” - a feature that configures the PEs differently for full utilization of the available processin logic after the FPGA is programmed. Our FPGA prototype is 10–13x faster than a 2.5 GHz quad-core Xeon, and 1.5–5x faster than a 240 core 1.3 GHz Tesla GPU, despite operating at a modest frequency of 125 MHz.","PeriodicalId":375788,"journal":{"name":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124671914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

3D recursive Gaussian IIR on GPU and FPGAs — A case for accelerating bandwidth-bounded applications GPU和fpga上的三维递归高斯IIR -一个加速带宽限制应用的案例

2011 IEEE 9th Symposium on Application Specific Processors (SASP) Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941081

J. Cong, Muhuan Huang, Yi Zou

引用次数: 11

Scalable object detection accelerators on FPGAs using custom design space exploration 可扩展的目标检测加速器在fpga上使用定制设计空间探索

2011 IEEE 9th Symposium on Application Specific Processors (SASP) Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941089

Chen-Chun Huang, F. Vahid

引用次数: 12