2011 IEEE 9th Symposium on Application Specific Processors (SASP)最新文献

A novel parallel Tier-1 coder for JPEG2000 using GPUs 一种基于gpu的JPEG2000并行1层编码器

2011 IEEE 9th Symposium on Application Specific Processors (SASP) Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941091

Roto Le, R. I. Bahar, J. Mundy

{"title":"A novel parallel Tier-1 coder for JPEG2000 using GPUs","authors":"Roto Le, R. I. Bahar, J. Mundy","doi":"10.1109/SASP.2011.5941091","DOIUrl":"https://doi.org/10.1109/SASP.2011.5941091","url":null,"abstract":"The JPEG2000 image compression standard provides superior features to the popular JPEG standard; however, the slow performance of software implementation of JPEG2000 has kept it from being widely adopted. More than 80% of the execution time for JPEG2000 is spent on the Tier-1 coding engine. While much effort over the past decade has been devoted to optimizing this component, its performance still remains slow. The major reason for this is that the Tier-1 coder consists of highly serial operations, each operating on individual bits in every single bit plane of the image samples. In addition, in the past there lacked an efficient hardware platform to provide massively parallel acceleration for Tier-1. However, the recent growth of general purpose graphic processing unit (GPGPU) provides a great opportunity to solve the problem with thousands of parallel processing threads. In this paper, the computation steps in JPEG2000 are examined, particularly in the Tier-1, and novel, GPGPU compatible, parallel processing methods for the sample-level coding of the images are developed. The GPGPU-based parallel engine allows for significant speedup in execution time compared to the JasPer JPEG2000 compression software. Running on a single Nvidia GTX 480 GPU, the parallel wavelet engine achieves 100× speedup, the parallel bit plane coder achieves more than 30× speedup, and the overall Tier-1 coder achieves up to 17× speedup.","PeriodicalId":375788,"journal":{"name":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","volume":"81 26","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120823991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

A hardware acceleration technique for gradient descent and conjugate gradient 梯度下降和共轭梯度的硬件加速技术

2011 IEEE 9th Symposium on Application Specific Processors (SASP) Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941086

David Kesler, Biplab Deka, Rakesh Kumar

引用次数: 9

How sensitive is processor customization to the workload's input datasets? 处理器自定义对工作负载的输入数据集有多敏感?

2011 IEEE 9th Symposium on Application Specific Processors (SASP) Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941070

Maximilien Breughe, Zheng Li, Yang Chen, Stijn Eyerman, O. Temam, Chengyong Wu, L. Eeckhout

{"title":"How sensitive is processor customization to the workload's input datasets?","authors":"Maximilien Breughe, Zheng Li, Yang Chen, Stijn Eyerman, O. Temam, Chengyong Wu, L. Eeckhout","doi":"10.1109/SASP.2011.5941070","DOIUrl":"https://doi.org/10.1109/SASP.2011.5941070","url":null,"abstract":"Hardware customization is an effective approach for meeting application performance requirements while achieving high levels of energy efficiency. Application-specific processors achieve high performance at low energy by tailoring their designs towards a specific workload, i.e., an application or application domain of interest. A fundamental question that has remained unanswered so far though is to what extent processor customization is sensitive to the training workload's input datasets. Current practice is to consider a single or only a few input datasets per workload during the processor design cycle — the reason being that simulation is prohibitively time-consuming which excludes considering a large number of datasets. This paper addresses this fundamental question, for the first time. In order to perform the large number of runs required to address this question in a reasonable amount of time, we first propose a mechanistic analytical model, built from first principles, that is accurate within 3.6% on average across a broad design space. The analytical model is at least 4 orders of magnitude faster than detailed cycle-accurate simulation for design space exploration. Using the model, we are able to study the sensitivity of a workload's input dataset on the optimum customized processor architecture. Considering MiBench benchmarks and 1000 datasets per benchmark, we conclude that processor customization is largely dataset-insensitive. This has an important implication in practice: a single or only a few datasets are sufficient for determining the optimum processor architecture when designing application-specific processors.","PeriodicalId":375788,"journal":{"name":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","volume":"7 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130921092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Modular high-throughput and low-latency sorting units for FPGAs in the Large Hadron Collider 大型强子对撞机fpga的模块化高通量和低延迟排序单元

2011 IEEE 9th Symposium on Application Specific Processors (SASP) Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941075

Amin Farmahini Farahani, A. Gregerson, M. Schulte, Katherine Compton

引用次数: 16

TARCAD: A template architecture for reconfigurable accelerator designs 用于可重构加速器设计的模板体系结构

2011 IEEE 9th Symposium on Application Specific Processors (SASP) Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941071

M. Shafiq, M. Pericàs, N. Navarro, E. Ayguadé

{"title":"TARCAD: A template architecture for reconfigurable accelerator designs","authors":"M. Shafiq, M. Pericàs, N. Navarro, E. Ayguadé","doi":"10.1109/SASP.2011.5941071","DOIUrl":"https://doi.org/10.1109/SASP.2011.5941071","url":null,"abstract":"In the race towards computational efficiency, accelerators are achieving prominence. Among the different types, accelerators built using reconfigurable fabric, such as FPGAs, have a tremendous potential due to the ability to customize the hardware to the application. However, the lack of a standard design methodology hinders the adoption of such devices and makes the portability and reusability across designs difficult. In addition, generation of highly customized circuits does not integrate nicely with high level synthesis tools. In this work, we introduce TARCAD, a template architecture to design reconfigurable accelerators. TARCAD enables high customization in the data management and compute engines while retaining a programming model based on generic programming principles. The template provides generality and scalable performance over a range of FPGAs. We describe the template architecture in detail and show how to implement five important scientific kernels: MxM, Acoustic Wave Equation, FFT, SpMV and Smith Waterman. TARCAD is compared with other High Level Synthesis models and is evaluated against GPUs, a well-known architecture that is far less customizable and, therefore, also easier to target from a simple and portable programming model. We analyze the TARCAD template and compare its efficiency on a large Xilinx Virtex-6 device to that of several recent GPU studies.","PeriodicalId":375788,"journal":{"name":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115121691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Hardware/software co-designed accelerator for vector graphics applications 硬件/软件共同设计的矢量图形应用加速器

2011 IEEE 9th Symposium on Application Specific Processors (SASP) Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941088

Shuo-Hung Chen, Hsiao-Mei Lin, H. Wei, Yi-Cheng Chen, Chih-Tsun Huang, Yeh-Ching Chung

引用次数: 5

Dynamically reconfigurable architecture for a driver assistant system 驾驶员辅助系统的动态可重构体系结构

2011 IEEE 9th Symposium on Application Specific Processors (SASP) Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941079

N. Harb, S. Niar, M. Saghir, Y. Elhillali, R. B. Atitallah

引用次数: 14

System integration of Elliptic Curve Cryptography on an OMAP platform 椭圆曲线密码在OMAP平台上的系统集成

2011 IEEE 9th Symposium on Application Specific Processors (SASP) Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941077

Sergey Morozov, Christian Tergino, P. Schaumont

{"title":"System integration of Elliptic Curve Cryptography on an OMAP platform","authors":"Sergey Morozov, Christian Tergino, P. Schaumont","doi":"10.1109/SASP.2011.5941077","DOIUrl":"https://doi.org/10.1109/SASP.2011.5941077","url":null,"abstract":"Elliptic Curve Cryptography (ECC) is popular for digital signatures and other public-key crypto-applications in embedded contexts. However, ECC is computationally intensive, and in particular the performance of the underlying modular arithmetic remains a concern. We investigate the design space of ECC on TI's OMAP 3530 platform, with a focus on using OMAP's DSP core to accelerate ECC computations for the ARM Cortex A8 core. We examine the opportunities of the heterogeneous platform for efficient ECC, including the efficient implementation of the underlying field multiplication on the DSP, and the design partitioning to minimize the communications overhead between ARM and DSP. By migrating the computations to the DSP, we demonstrate a significant speedup for the underlying modular arithmetic with up to 9.24x reduction in execution time, compared to the implementation executing on the ARM Cortex processor. Prototype measurements show an energy reduction of up to 5.3 times. We conclude that a heterogeneous platform offers substantial improvements in performance and energy, but we also point out that the cost of inter-processor communication cannot be ignored.","PeriodicalId":375788,"journal":{"name":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116021080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

ARTE: An Application-specific Run-Time management framework for multi-core systems 用于多核系统的特定于应用程序的运行时管理框架

2011 IEEE 9th Symposium on Application Specific Processors (SASP) Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941085

Giovanni Mariani, G. Palermo, C. Silvano, V. Zaccaria

引用次数: 7

A fast CUDA implementation of agrep algorithm for approximate nucleotide sequence matching 近似核苷酸序列匹配的agrep算法的快速CUDA实现

2011 IEEE 9th Symposium on Application Specific Processors (SASP) Pub Date : 2011-06-05 DOI: 10.1109/SASP.2011.5941082

Hongjian Li, Bing Ni, M. Wong, K. Leung

{"title":"A fast CUDA implementation of agrep algorithm for approximate nucleotide sequence matching","authors":"Hongjian Li, Bing Ni, M. Wong, K. Leung","doi":"10.1109/SASP.2011.5941082","DOIUrl":"https://doi.org/10.1109/SASP.2011.5941082","url":null,"abstract":"The availability of huge amounts of nucleotide sequences catalyzes the development of fast algorithms for approximate DNA and RNA string matching. However, most existing online algorithms can only handle small scale problems. When querying large genomes, their performance becomes unacceptable. Offline algorithms such as Bowtie and BWA require building indexes, and their memory requirement is high. We have developed a fast CUDA implementation of agrep algorithm for approximate nucleotide sequence matching by exploiting the huge computational power of modern GPU hardware. Our CUDA program is capable of searching large genomes for patterns of length up to 64 with edit distance up to 9. For example, it is able to search the entire human genome (3.10 Gbp in 24 chromosomes) for patterns of lengths of 30 and 60 with edit distances of 3 and 6 within 371 and 1,188 milliseconds respectively on one NVIDIA GeForce GTX285 graphics card, achieving 70-fold and 36-fold speedups over multithreaded QuadCore CPU counterpart. Our program employs online approach and does not require building indexes of any kind, it thus can be applied in real time. Using two-bits-for-one-character binary representation, its memory requirement is merely one fourth of the original genome size. Therefore it is possible to load multiple genomes simultaneously. The x86 and x64 executables for Linux and Windows, C++ source code, documentations, user manual, and an AJAX MVC website for online real time searching are available at http://agrep.cse.cuhk.edu.hk. Users can also send emails to CUDAagrepGmail.com to queue up for a job.","PeriodicalId":375788,"journal":{"name":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122911398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21