2011 IEEE 9th Symposium on Application Specific Processors (SASP)最新文献

筛选
英文 中文
USHA: Unified software and hardware architecture for video decoding USHA:用于视频解码的统一软硬件架构
2011 IEEE 9th Symposium on Application Specific Processors (SASP) Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941074
Adarsha Rao, S. Nandy, Hristo Nikolov, E. Deprettere
{"title":"USHA: Unified software and hardware architecture for video decoding","authors":"Adarsha Rao, S. Nandy, Hristo Nikolov, E. Deprettere","doi":"10.1109/SASP.2011.5941074","DOIUrl":"https://doi.org/10.1109/SASP.2011.5941074","url":null,"abstract":"Video decoders used in emerging applications need to be flexible to handle a large variety of video formats and deliver scalable performance to handle wide variations in workloads. In this paper we propose a unified software and hardware architecture for video decoding to achieve scalable performance with flexibility. The light weight processor tiles and the reconfigurable hardware tiles in our architecture enable software and hardware implementations to co-exist, while a programmable interconnect enables dynamic interconnection of the tiles. Our process network oriented compilation flow achieves realization agnostic application partitioning and enables seamless migration across uniprocessor, multi-processor, semi hardware and full hardware implementations of a video decoder. An application quality of service aware scheduler monitors and controls the operation of the entire system. We prove the concept through a prototype of the architecture on an off-the-shelf FPGA. The FPGA prototype shows a scaling in performance from QCIF to 1080p resolutions in four discrete steps. We also demonstrate that the reconfiguration time is short enough to allow migration from one configuration to the other without any frame loss.","PeriodicalId":375788,"journal":{"name":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115850435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Integrating formal verification and high-level processor pipeline synthesis 集成形式化验证和高级处理器管道合成
2011 IEEE 9th Symposium on Application Specific Processors (SASP) Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941073
E. Nurvitadhi, J. Hoe, T. Kam, Shih-Lien Lu
{"title":"Integrating formal verification and high-level processor pipeline synthesis","authors":"E. Nurvitadhi, J. Hoe, T. Kam, Shih-Lien Lu","doi":"10.1109/SASP.2011.5941073","DOIUrl":"https://doi.org/10.1109/SASP.2011.5941073","url":null,"abstract":"When a processor implementation is synthesized from a specification using an automatic framework, this implementation still should be verified against its specification to ensure the automatic framework introduced no error. This paper presents our effort in integrating fully automated formal verification with a high-level processor pipeline synthesis framework. As an integral part of the pipeline synthesis, our framework also emits SMV models for checking the functional equivalence between the output pipelined processor implementation and its input non-pipelined specification. Well known compositional model checking techniques are automatically applied to curtail state explosion during model checking. The paper reports case studies of applying this integrated framework to synthesize and formally verify pipelined RISC and CISC processors.","PeriodicalId":375788,"journal":{"name":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123962944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Frameworks for GPU Accelerators: A comprehensive evaluation using 2D/3D image registration GPU加速器框架:使用2D/3D图像配准的综合评估
2011 IEEE 9th Symposium on Application Specific Processors (SASP) Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941083
Richard Membarth, Frank Hannig, J. Teich, M. Körner, Wieland Eckert
{"title":"Frameworks for GPU Accelerators: A comprehensive evaluation using 2D/3D image registration","authors":"Richard Membarth, Frank Hannig, J. Teich, M. Körner, Wieland Eckert","doi":"10.1109/SASP.2011.5941083","DOIUrl":"https://doi.org/10.1109/SASP.2011.5941083","url":null,"abstract":"In the last decade, there has been a dramatic growth in research and development of massively parallel many-core architectures like graphics hardware, both in academia and industry. This changed also the way programs are written in order to leverage the processing power of a multitude of cores on the same hardware. In the beginning, programmers had to use special graphics programming interfaces to express general purpose computations on graphics hardware. Today, several frameworks exist to relieve the programmer from such tasks. In this paper, we present five frameworks for parallelization on GPU Accelerators, namely RapidMind, PGI Accelerator, HMPP Workbench, OpenCL, and CUDA. To evaluate these frameworks, a real world application from medical imaging is investigated, the 2D/3D image registration.","PeriodicalId":375788,"journal":{"name":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","volume":"98 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116013333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
A massively parallel implementation of QC-LDPC decoder on GPU QC-LDPC解码器在GPU上的大规模并行实现
2011 IEEE 9th Symposium on Application Specific Processors (SASP) Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941084
Guohui Wang, Michael Wu, Yang Sun, Joseph R. Cavallaro
{"title":"A massively parallel implementation of QC-LDPC decoder on GPU","authors":"Guohui Wang, Michael Wu, Yang Sun, Joseph R. Cavallaro","doi":"10.1109/SASP.2011.5941084","DOIUrl":"https://doi.org/10.1109/SASP.2011.5941084","url":null,"abstract":"The graphics processor unit (GPU) is able to provide a low-cost and flexible software-based multi-core architecture for high performance computing. However, it is still very challenging to efficiently map the real-world applications to GPU and fully utilize the computational power of GPU. As a case study, we present a GPU-based implementation of a real-world digital signal processing (DSP) application: low-density parity-check (LDPC) decoder. The paper shows the efforts we made to map the algorithm onto the massively parallel architecture of GPU and fully utilize GPU's computational resources to significantly boost the performance. Moreover, several efficient data structures have been proposed to reduce the memory access latency and the memory bandwidth requirement. Experimental results show that the proposed GPU-based LDPC decoding accelerator can take advantage of the multi-core computational power provided by GPU and achieve high throughput up to 100.3Mbps.","PeriodicalId":375788,"journal":{"name":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","volume":"2014 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121523746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 60
FPGA based parallel architecture implementation of Stacked Error Diffusion algorithm 基于FPGA并行架构的堆叠误差扩散算法实现
2011 IEEE 9th Symposium on Application Specific Processors (SASP) Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941080
R. Venugopal, J. Heath, D. Lau
{"title":"FPGA based parallel architecture implementation of Stacked Error Diffusion algorithm","authors":"R. Venugopal, J. Heath, D. Lau","doi":"10.1109/SASP.2011.5941080","DOIUrl":"https://doi.org/10.1109/SASP.2011.5941080","url":null,"abstract":"Digital halftoning is a crucial technique used in digital printers to convert a continuous-tone image into a pattern of black and white dots. Halftoning is used since printers have a limited availability of inks and cannot reproduce all the color intensities in a continuous image. Error Diffusion is an algorithm in halftoning that iteratively quantizes pixels in a neighborhood dependent fashion. This manuscript focuses on the development, design and Hardware Description Language (HDL) functional and performance simulation validation of a parallel scalable hardware architecture for high performance implementation of a high quality Stacked Error Diffusion algorithm. A CMYK printer, utilizing the high quality error diffusion algorithm, would be required to execute error diffusion 16 times per pixel, resulting in a potentially high computational cost. The algorithm, originally described in ‘C’, requires a significant processing time when implemented on a conventional single Central Processing Unit (CPU) based computer system. Thus, a new scalable high performance parallel hardware processor architecture is developed to implement the algorithm and is implemented to and tested on a single Programmable Logic Device (PLD) based Field Programmable Gate Array (FPGA) chip. There is a significant decrease in the run time of the algorithm when run on the newly proposed parallel architecture implemented to FPGA technology compared to execution on a single CPU based system.","PeriodicalId":375788,"journal":{"name":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132609094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
ISIS: An accelerator for Sphinx speech recognition ISIS:斯芬克斯语音识别加速器
2011 IEEE 9th Symposium on Application Specific Processors (SASP) Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941078
A. Chun, Jenny X. Chang, Zhen Fang, Ravishankar R. Iyer, M. Deisher
{"title":"ISIS: An accelerator for Sphinx speech recognition","authors":"A. Chun, Jenny X. Chang, Zhen Fang, Ravishankar R. Iyer, M. Deisher","doi":"10.1109/SASP.2011.5941078","DOIUrl":"https://doi.org/10.1109/SASP.2011.5941078","url":null,"abstract":"The ability to naturally interact with devices is becoming increasingly important. Speech recognition is one well-known solution to provide easy, hands-free user-device interaction. However, speech recognition has significant computation and memory bandwidth requirements, making it challenging to offer at high performance, real-time and ultra-low power for handheld devices. In this paper, we present a speech recognition accelerator called ISIS. We show the overall execution flow of the accelerated speech recognition solution along with optimizations and the key metrics of performance, area and power.","PeriodicalId":375788,"journal":{"name":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133452174","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Memory-efficient volume ray tracing on GPU for radiotherapy 基于GPU的高内存体积射线追踪
2011 IEEE 9th Symposium on Application Specific Processors (SASP) Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941076
Bo Zhou, X. Hu, D. Chen
{"title":"Memory-efficient volume ray tracing on GPU for radiotherapy","authors":"Bo Zhou, X. Hu, D. Chen","doi":"10.1109/SASP.2011.5941076","DOIUrl":"https://doi.org/10.1109/SASP.2011.5941076","url":null,"abstract":"Ray tracing within a uniform grid volume is a fundamental process invoked frequently by many radiation dose calculation methods in radiotherapy. Recent advances of the graphics processing units (GPU) help real-time dose calculation become a reachable goal. However, the performance of the known GPU methods for volume ray tracing is all bounded by the memory-throughput, which leads to inefficient usage of the GPU computational capacity. This paper introduces a simple yet effective ray tracing technique aiming to improve the memory bandwidth utilization of GPU for processing a massive number of rays. The idea is to exploit the coherent relationship between the rays and match the ray tracing behavior with the underlying characteristics of the GPU memory system. The proposed method has been evaluated on 4 phantom setups using randomly generated rays. The collapsed-cone convolution/superposition (CCCS) dose calculation method is also implemented with/without the proposed approach to verify the feasibility of our method. Compared with the direct GPU implementation of the popular 3DDDA algorithm, the new method provides a speedup in the range of 1.8–2.7X for the given phantom settings. Major performance factors such as ray origins, phantom sizes, and pyramid sizes are also analyzed. The proposed technique was also shown to lead to a speedup of 1.3–1.6X over the original GPU implementation of the CCCS algorithm.","PeriodicalId":375788,"journal":{"name":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127756601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
A parallel accelerator for semantic search 语义搜索的并行加速器
2011 IEEE 9th Symposium on Application Specific Processors (SASP) Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941090
Abhinandan Majumdar, S. Cadambi, S. Chakradhar, H. Graf
{"title":"A parallel accelerator for semantic search","authors":"Abhinandan Majumdar, S. Cadambi, S. Chakradhar, H. Graf","doi":"10.1109/SASP.2011.5941090","DOIUrl":"https://doi.org/10.1109/SASP.2011.5941090","url":null,"abstract":"Semantic text analysis is a technique used in advertisement placement, cognitive databases and search engines. With increasing amounts of data and stringent response-time requirements, improving the underlying implementation of semantic analysis becomes critical. To this end, we look at Supervised Semantic Indexing (SSI), a recently proposed algorithm for semantic analysis. SSI ranks a large number of documents based on their semantic similarity to a text query. For each query, it computes millions of dot products on unstructured data, generates a large intermediate result, and then performs ranking. SSI underperforms on both state-of-the-art multi-cores as well as GPUs. Its performance scalability on multi-cores is hampered by their limited support for fine-grained data parallelism. GPUs, though beat multi-cores by running thousands of threads, cannot handle large intermediate data because of their small on-chip memory. Motivated by this, we present an FPGA-based hardware accelerator for semantic analysis. As a key feature, the accelerator combines hundreds of simple processing elements together with in-memory processing to simultaneously generate and process (consume) the large intermediate data. It also supports “dynamic parallelism” - a feature that configures the PEs differently for full utilization of the available processin logic after the FPGA is programmed. Our FPGA prototype is 10–13x faster than a 2.5 GHz quad-core Xeon, and 1.5–5x faster than a 240 core 1.3 GHz Tesla GPU, despite operating at a modest frequency of 125 MHz.","PeriodicalId":375788,"journal":{"name":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124671914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
3D recursive Gaussian IIR on GPU and FPGAs — A case for accelerating bandwidth-bounded applications GPU和fpga上的三维递归高斯IIR -一个加速带宽限制应用的案例
2011 IEEE 9th Symposium on Application Specific Processors (SASP) Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941081
J. Cong, Muhuan Huang, Yi Zou
{"title":"3D recursive Gaussian IIR on GPU and FPGAs — A case for accelerating bandwidth-bounded applications","authors":"J. Cong, Muhuan Huang, Yi Zou","doi":"10.1109/SASP.2011.5941081","DOIUrl":"https://doi.org/10.1109/SASP.2011.5941081","url":null,"abstract":"GPU device typically has a higher off-chip bandwidth than FPGA-based systems. Thus typically GPU should perform better for bandwidth-bounded massive parallel applications. In this paper, we present our implementations of a 3D recursive Gaussian IIR on multi-core CPU, many-core GPU and multi-FPGA platforms. Our baseline implementation on the CPU features the smallest arithmetic computation (2 MADDs per dimension). While this application is clearly bandwidth bounded, the difference on the memory subsystems translates to different bandwidth optimization techniques. Our implementations on the GPU and FPGA platforms show 26X and 33X speedup respectively over optimized single-thread code on CPU.","PeriodicalId":375788,"journal":{"name":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","volume":"35 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121007662","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Scalable object detection accelerators on FPGAs using custom design space exploration 可扩展的目标检测加速器在fpga上使用定制设计空间探索
2011 IEEE 9th Symposium on Application Specific Processors (SASP) Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941089
Chen-Chun Huang, F. Vahid
{"title":"Scalable object detection accelerators on FPGAs using custom design space exploration","authors":"Chen-Chun Huang, F. Vahid","doi":"10.1109/SASP.2011.5941089","DOIUrl":"https://doi.org/10.1109/SASP.2011.5941089","url":null,"abstract":"We discuss FPGA implementations of object (such as face) detectors in video streams using the accurate Haar-feature based algorithm. Rather than creating one implementation for one FPGA, we develop a method to generate a series of implementations that have different size and performance to target different FPGA devices. The automatic generation was enabled by custom design space exploration on a particular design problem relating to the communication architecture used to support different numbers of image classifiers. The exploration algorithm uses content information in each feature set to optimize and generate a scalable communication architecture. We generated fully-working implementations for Xilinx Virtex5 LX50T, LX110T, and LX155T FPGA devices, using various amounts of available device capacity, leading to speedups ranging from 0.6x to 25x compared to a 3.0 GHz Pentium 4 desktop machine. Automated generators that include custom design space exploration may become more necessary when creating hardware accelerators intended for use across a wide range of existing and future FPGA devices.","PeriodicalId":375788,"journal":{"name":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124204082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信