{"title":"A novel parallel Tier-1 coder for JPEG2000 using GPUs","authors":"Roto Le, R. I. Bahar, J. Mundy","doi":"10.1109/SASP.2011.5941091","DOIUrl":"https://doi.org/10.1109/SASP.2011.5941091","url":null,"abstract":"The JPEG2000 image compression standard provides superior features to the popular JPEG standard; however, the slow performance of software implementation of JPEG2000 has kept it from being widely adopted. More than 80% of the execution time for JPEG2000 is spent on the Tier-1 coding engine. While much effort over the past decade has been devoted to optimizing this component, its performance still remains slow. The major reason for this is that the Tier-1 coder consists of highly serial operations, each operating on individual bits in every single bit plane of the image samples. In addition, in the past there lacked an efficient hardware platform to provide massively parallel acceleration for Tier-1. However, the recent growth of general purpose graphic processing unit (GPGPU) provides a great opportunity to solve the problem with thousands of parallel processing threads. In this paper, the computation steps in JPEG2000 are examined, particularly in the Tier-1, and novel, GPGPU compatible, parallel processing methods for the sample-level coding of the images are developed. The GPGPU-based parallel engine allows for significant speedup in execution time compared to the JasPer JPEG2000 compression software. Running on a single Nvidia GTX 480 GPU, the parallel wavelet engine achieves 100× speedup, the parallel bit plane coder achieves more than 30× speedup, and the overall Tier-1 coder achieves up to 17× speedup.","PeriodicalId":375788,"journal":{"name":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","volume":"81 26","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120823991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A hardware acceleration technique for gradient descent and conjugate gradient","authors":"David Kesler, Biplab Deka, Rakesh Kumar","doi":"10.1109/SASP.2011.5941086","DOIUrl":"https://doi.org/10.1109/SASP.2011.5941086","url":null,"abstract":"Application Robustification, a promising approach for reducing processor power, converts applications into numerical optimization problems and solves them using gradient descent and conjugate gradient algorithms [1]. The improvement in robustness, however, comes at the expense of performance when compared to the baseline non-iterative versions of these applications. To mitigate the performance loss from robustification, we present the design of a hardware accelerator and corresponding software support that accelerate gradient descent and conjugate gradient based iterative implementation of applications. Unlike traditional accelerators, our design accelerates different types of linear algebra operations found in many algorithms and is capable of efficiently handling sparse matrices that arise in applications such as graph matching. We show that the proposed accelerator can provide significant speedups for iterative versions of several applications and that for some applications such as least squares, it can substantially improve the computation time as compared to the baseline non-iterative implementation.","PeriodicalId":375788,"journal":{"name":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","volume":"205 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116397595","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maximilien Breughe, Zheng Li, Yang Chen, Stijn Eyerman, O. Temam, Chengyong Wu, L. Eeckhout
{"title":"How sensitive is processor customization to the workload's input datasets?","authors":"Maximilien Breughe, Zheng Li, Yang Chen, Stijn Eyerman, O. Temam, Chengyong Wu, L. Eeckhout","doi":"10.1109/SASP.2011.5941070","DOIUrl":"https://doi.org/10.1109/SASP.2011.5941070","url":null,"abstract":"Hardware customization is an effective approach for meeting application performance requirements while achieving high levels of energy efficiency. Application-specific processors achieve high performance at low energy by tailoring their designs towards a specific workload, i.e., an application or application domain of interest. A fundamental question that has remained unanswered so far though is to what extent processor customization is sensitive to the training workload's input datasets. Current practice is to consider a single or only a few input datasets per workload during the processor design cycle — the reason being that simulation is prohibitively time-consuming which excludes considering a large number of datasets. This paper addresses this fundamental question, for the first time. In order to perform the large number of runs required to address this question in a reasonable amount of time, we first propose a mechanistic analytical model, built from first principles, that is accurate within 3.6% on average across a broad design space. The analytical model is at least 4 orders of magnitude faster than detailed cycle-accurate simulation for design space exploration. Using the model, we are able to study the sensitivity of a workload's input dataset on the optimum customized processor architecture. Considering MiBench benchmarks and 1000 datasets per benchmark, we conclude that processor customization is largely dataset-insensitive. This has an important implication in practice: a single or only a few datasets are sufficient for determining the optimum processor architecture when designing application-specific processors.","PeriodicalId":375788,"journal":{"name":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","volume":"7 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130921092","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amin Farmahini Farahani, A. Gregerson, M. Schulte, Katherine Compton
{"title":"Modular high-throughput and low-latency sorting units for FPGAs in the Large Hadron Collider","authors":"Amin Farmahini Farahani, A. Gregerson, M. Schulte, Katherine Compton","doi":"10.1109/SASP.2011.5941075","DOIUrl":"https://doi.org/10.1109/SASP.2011.5941075","url":null,"abstract":"This paper presents efficient techniques for designing high-throughput, low-latency sorting units for FPGA implementation. Our sorting units use modular design techniques that hierarchically construct large sorting units from smaller building blocks. They are optimized for situations in which only the M largest numbers from N inputs are needed; this situation commonly occurs in high-energy physics experiments and other forms of digital signal processing. Based on these techniques, we design parameterized, pipelined sorting units. A detailed analysis indicates that their resource requirements scale linearly with the number of inputs, latencies scale logarithmically with the number of inputs, and frequencies remain fairly constant. Synthesis results indicate that a single pipelined 256-to-4 sorting unit with 19 stages can perform 200 million sorts per second with a latency of about 95 ns per sort on a Virtex-5 FPGA.","PeriodicalId":375788,"journal":{"name":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","volume":"176 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133542815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"TARCAD: A template architecture for reconfigurable accelerator designs","authors":"M. Shafiq, M. Pericàs, N. Navarro, E. Ayguadé","doi":"10.1109/SASP.2011.5941071","DOIUrl":"https://doi.org/10.1109/SASP.2011.5941071","url":null,"abstract":"In the race towards computational efficiency, accelerators are achieving prominence. Among the different types, accelerators built using reconfigurable fabric, such as FPGAs, have a tremendous potential due to the ability to customize the hardware to the application. However, the lack of a standard design methodology hinders the adoption of such devices and makes the portability and reusability across designs difficult. In addition, generation of highly customized circuits does not integrate nicely with high level synthesis tools. In this work, we introduce TARCAD, a template architecture to design reconfigurable accelerators. TARCAD enables high customization in the data management and compute engines while retaining a programming model based on generic programming principles. The template provides generality and scalable performance over a range of FPGAs. We describe the template architecture in detail and show how to implement five important scientific kernels: MxM, Acoustic Wave Equation, FFT, SpMV and Smith Waterman. TARCAD is compared with other High Level Synthesis models and is evaluated against GPUs, a well-known architecture that is far less customizable and, therefore, also easier to target from a simple and portable programming model. We analyze the TARCAD template and compare its efficiency on a large Xilinx Virtex-6 device to that of several recent GPU studies.","PeriodicalId":375788,"journal":{"name":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115121691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hardware/software co-designed accelerator for vector graphics applications","authors":"Shuo-Hung Chen, Hsiao-Mei Lin, H. Wei, Yi-Cheng Chen, Chih-Tsun Huang, Yeh-Ching Chung","doi":"10.1109/SASP.2011.5941088","DOIUrl":"https://doi.org/10.1109/SASP.2011.5941088","url":null,"abstract":"This paper proposes a new hardware accelerator to speed up the performance of vector graphics applications on complex embedded systems. The resulting hardware accelerator is synthesized on a field-programmable gate array (FPGA) and integrated with software components. The paper also introduces a hardware/software co-verification environment which provides in-system at-speed functional verification and performance evaluation to verify the hardware/software integrated architecture. The experimental results demonstrate that the integrated hardware accelerator is fifty times faster than a compiler-optimized software component and it enables vector graphics applications to run nearly two times faster.","PeriodicalId":375788,"journal":{"name":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121308005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Harb, S. Niar, M. Saghir, Y. Elhillali, R. B. Atitallah
{"title":"Dynamically reconfigurable architecture for a driver assistant system","authors":"N. Harb, S. Niar, M. Saghir, Y. Elhillali, R. B. Atitallah","doi":"10.1109/SASP.2011.5941079","DOIUrl":"https://doi.org/10.1109/SASP.2011.5941079","url":null,"abstract":"Application-specific programmable processors are increasingly being replaced by FPGAs, which offer high levels of logic density, rich sets of embedded hardware blocks, and a high degree of customizability and reconfigurability. New FPGA features such as Dynamic Partial Reconfiguration (DPR) can be leveraged to reduce resource utilization and power consumption while still providing high levels of performance. In this paper, we describe our implementation of a dynamically reconfigurable multiple-target tracking (MTT) module for an automotive driver assistance system. Our module implements a dynamically reconfigurable filtering block that changes with changing driving conditions.","PeriodicalId":375788,"journal":{"name":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132043586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"System integration of Elliptic Curve Cryptography on an OMAP platform","authors":"Sergey Morozov, Christian Tergino, P. Schaumont","doi":"10.1109/SASP.2011.5941077","DOIUrl":"https://doi.org/10.1109/SASP.2011.5941077","url":null,"abstract":"Elliptic Curve Cryptography (ECC) is popular for digital signatures and other public-key crypto-applications in embedded contexts. However, ECC is computationally intensive, and in particular the performance of the underlying modular arithmetic remains a concern. We investigate the design space of ECC on TI's OMAP 3530 platform, with a focus on using OMAP's DSP core to accelerate ECC computations for the ARM Cortex A8 core. We examine the opportunities of the heterogeneous platform for efficient ECC, including the efficient implementation of the underlying field multiplication on the DSP, and the design partitioning to minimize the communications overhead between ARM and DSP. By migrating the computations to the DSP, we demonstrate a significant speedup for the underlying modular arithmetic with up to 9.24x reduction in execution time, compared to the implementation executing on the ARM Cortex processor. Prototype measurements show an energy reduction of up to 5.3 times. We conclude that a heterogeneous platform offers substantial improvements in performance and energy, but we also point out that the cost of inter-processor communication cannot be ignored.","PeriodicalId":375788,"journal":{"name":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116021080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Giovanni Mariani, G. Palermo, C. Silvano, V. Zaccaria
{"title":"ARTE: An Application-specific Run-Time management framework for multi-core systems","authors":"Giovanni Mariani, G. Palermo, C. Silvano, V. Zaccaria","doi":"10.1109/SASP.2011.5941085","DOIUrl":"https://doi.org/10.1109/SASP.2011.5941085","url":null,"abstract":"Programmable multi-core and many-core platforms increase exponentially the challenge of task mapping and scheduling, provided that enough task-parallelism does exist for each application. This problem worsens when dealing with small ecosystems such as embedded systems-on-chip. In fact, in this case, the assumption of exploiting a traditional operating system is out of context given the memory available to satisfy the run-time footprint of such a configuration.","PeriodicalId":375788,"journal":{"name":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","volume":"53 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123939248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A fast CUDA implementation of agrep algorithm for approximate nucleotide sequence matching","authors":"Hongjian Li, Bing Ni, M. Wong, K. Leung","doi":"10.1109/SASP.2011.5941082","DOIUrl":"https://doi.org/10.1109/SASP.2011.5941082","url":null,"abstract":"The availability of huge amounts of nucleotide sequences catalyzes the development of fast algorithms for approximate DNA and RNA string matching. However, most existing online algorithms can only handle small scale problems. When querying large genomes, their performance becomes unacceptable. Offline algorithms such as Bowtie and BWA require building indexes, and their memory requirement is high. We have developed a fast CUDA implementation of agrep algorithm for approximate nucleotide sequence matching by exploiting the huge computational power of modern GPU hardware. Our CUDA program is capable of searching large genomes for patterns of length up to 64 with edit distance up to 9. For example, it is able to search the entire human genome (3.10 Gbp in 24 chromosomes) for patterns of lengths of 30 and 60 with edit distances of 3 and 6 within 371 and 1,188 milliseconds respectively on one NVIDIA GeForce GTX285 graphics card, achieving 70-fold and 36-fold speedups over multithreaded QuadCore CPU counterpart. Our program employs online approach and does not require building indexes of any kind, it thus can be applied in real time. Using two-bits-for-one-character binary representation, its memory requirement is merely one fourth of the original genome size. Therefore it is possible to load multiple genomes simultaneously. The x86 and x64 executables for Linux and Windows, C++ source code, documentations, user manual, and an AJAX MVC website for online real time searching are available at http://agrep.cse.cuhk.edu.hk. Users can also send emails to CUDAagrepGmail.com to queue up for a job.","PeriodicalId":375788,"journal":{"name":"2011 IEEE 9th Symposium on Application Specific Processors (SASP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122911398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}