{"title":"OpenMP device offloading to FPGA accelerators","authors":"Lukas Sommer, Jens Korinth, A. Koch","doi":"10.1109/ASAP.2017.7995280","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995280","url":null,"abstract":"Future high-performance computing systems will need to include multiple specialized accelerators in a single heterogeneous system to overcome power-density limitations of CPU performance.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131987101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DeepPump: Multi-pumping deep Neural Networks","authors":"Ruizhe Zhao, T. Todman, W. Luk, Xinyu Niu","doi":"10.1109/ASAP.2017.7995281","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995281","url":null,"abstract":"This paper presents DeepPump, an approach that generates CNN hardware designs with multi-pumping, which have competitive performance when compared with previous designs. Future work includes integrating DeepPump with other optimisations, and providing further evaluations on various FPGA platforms.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"296 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132147797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CFStore: Boosting Hybrid storage performance by device crossfire","authors":"Wei Zhou, D. Feng, Zhipeng Tan","doi":"10.1109/ASAP.2017.7995265","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995265","url":null,"abstract":"Hybrid storage is widely implemented as it satisfies the requirements of capacity and performance in an economically viable fashion. With the fast technical improvement, Hybrid storage systems consisting of several types of SSDs will be adopted gradually. Existing works mostly concentrate on thoroughly utilizing high-performance device but neglect the capability of low-performance device. This paper proposes a device crossfire method to boost hybrid storage performance by efficiently leveraging both high-performance and low-performance devices. Performance-critical data are appropriately off-loaded to low-performance device to exploit access parallelism. The implemented storage system CFStore exhibits good performance during experiments. Compared to famous hybrid storage system Hystor, CFStore improves throughput by 17.9%–42.6%, and reduces latency by 15.9%–35.0%.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"224 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115739365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hardware-accelerated CCD readout smear correction for Fast Solar Polarimeter","authors":"Stefan Tabel, Korbinian Weikl, W. Stechele","doi":"10.1109/ASAP.2017.7995261","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995261","url":null,"abstract":"Shutterless frame store charge-coupled devices (CCDs) are commonly used in ground-based solar observations, but the characteristical readout smear error of such devices hinders an application of frame store CCDs to autonomous missions. The combination of polarimetric modulation and image accumulation disables a correction of this error via software-based post-facto processing if in addition microvibrations occur during flight. This paper presents the first FPGA-based architecture for online smear correction of images from frame store CCDs, which allows for the usage of a certain frame store CCD camera on a balloon-borne solar observatory. First, we explore fast convolution-based algorithms with respect to their properties for an implementation. Afterwards, a hardware architecture is derived and implemented. Our results show that 400 frames of megapixel size can be corrected per second, maintaining an acceptable power consumption of less than 12 Watt. Finally, we discuss the circuit and show the degrees of freedom for further designs.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122232678","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modeling and evaluation for gather/scatter operations in Vector-SIMD architectures","authors":"Hongbing Tan, Haiyan Chen, Sheng Liu, Jianguo Wu","doi":"10.1109/ASAP.2017.7995271","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995271","url":null,"abstract":"Gather/scatter are state of the art vector memory access modes in Vector-SIMD architectures. However, because of the stochastic and complicated properties, the hardware design of gather/scatter operations lacks theoretical analysis and modeling. This paper proposes a model for gather/scatter operations on local vector memory for the first time. The model can not only give all the possible distributions of access locations, calculate the probability of access conflicts and predict the number of access conflicts, but also can provide the theoretical guidance for the performance optimization. This model is validated through experiments which can guide users to more specifically design and optimize the implementation of gather/scatter operations.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126581629","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Real-time object detection in software with custom vector instructions and algorithm changes","authors":"Joe Edwards, G. Lemieux","doi":"10.1109/ASAP.2017.7995262","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995262","url":null,"abstract":"Real-time vision applications place stringent performance requirements on embedded systems. To meet performance requirements, embedded systems often require hardware implementations. This approach is unfavorable as hardware development can be difficult to debug, time-consuming, and require extensive skill. This paper presents a case study of accelerating face detection, often part of a complex image processing pipeline, using a software/hardware hybrid approach. As a baseline, the algorithm is initially run on a scalar ARM Cortex-A9 application processor found on a Xilinx Zynq device. Next, using a previously designed vector engine implemented in the FPGA fabric, the algorithm is vectorized, using only standard vector instructions, to achieve a 25× speedup. Then, we accelerate the critical inner loops by adding two hardware-assisted custom vector instructions for an additional 10× speedup, yielding 248× speedup over the initial Cortex-A9 baseline. Collectively, the custom instructions require fewer than 800 lines of VHDL code, including comments and blank lines. Compared to previous hardware-only face detection systems, our work is 1.5 to 6.8 times faster. This approach demonstrates that good performance can be obtained from software-only vectorization, and a small amount of custom hardware can provide a significant acceleration boost.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126292217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An embedded scalable linear model predictive hardware-based controller using ADMM","authors":"Pei Zhang, Joseph Zambreno, Phillip H. Jones","doi":"10.1109/ASAP.2017.7995276","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995276","url":null,"abstract":"Model predictive control (MPC) is a popular advanced model-based control algorithm for controlling systems that must respect a set of system constraints (e.g. actuator force limitations). However, the computing requirements of MPC limits the suitability of deploying its software implementation into embedded controllers requiring high update rates. This paper presents a scalable embedded MPC controller implemented on a field-programmable gate array (FPGA) coupled with an on-chip ARM processor. Our architecture implements an Alternating Direction Method of Multipliers (ADMM) approach for computing MPC controller commands. All computations are performed using floating-point arithmetic. We introduce a software/hardware (SW/HW) co-design methodology, for which the ARM software can configure on-chip Block RAM to allow users to 1) configure the MPC controller for a wide range of plants, and 2) update at runtime the desired trajectory to track. Our hardware architecture has the flexibility to compromise between the amount of hardware resources used (regarding Block RAMs and DSPs) and the controller computing speed. For example, this flexibility gives the ability to control plants modeled by a large number of decision variables (i.e. a plant model using many Block RAMs) with a small number of computing resources (i.e. DSPs) at the cost of increased computing time. The hardware controller is verified using a Plant-on-Chip (PoC), which is configured to emulate a mass-spring system in real-time. A major driving goal of this work is to architect an SW/HW platform that brings FPGAs a step closer to being widely adopted by advanced control algorithm designers for deploying their algorithms into embedded systems.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116885172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Staged Memory Resource Management Method for CMP systems","authors":"Yangguo Liu, Junlin Lu, Dong Tong, Xu Cheng","doi":"10.1109/ASAP.2017.7995264","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995264","url":null,"abstract":"Memory interference is a critical impediment to system performance in CMP systems. To address this problem, we first propose a Dynamically Proportional Bandwidth Throttling policy (DPBT), which dynamically throttles back memory-intensive applications based on their memory access behavior. DPBT achieves a more balance memory bandwidth partitioning. Moreover, we improve the previous memory channel partitioning scheme by integrating it with a bank partitioning. We further integrate DPBT with the improved memory channel partitioning scheme and a memory scheduling policy to leverage the architecture advantages, and present a Stage Memory Resource Management Method (SRM). Experimental results show that DPBT improves system throughput/fairness by 13.5%/31.1%. SRM provides 27.1% better system throughput and 34.8% better system fairness.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125600532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High-throughput area-efficient processor for 3GPP LTE cryptographic core algorithms","authors":"Yuanhong Huo, Dake Liu","doi":"10.1109/ASAP.2017.7995285","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995285","url":null,"abstract":"There are three sets of cryptographic algorithms working on LTE technology and each set based on one core algorithm. In high-end embedded systems, it is necessary to implement the three core algorithms: block cipher AES-128 and stream ciphers SNOW 3G and ZUC, with high performance and low silicon cost. This paper proposes a high throughput ASIP (application-specific instruction-set processor) design (CP-LTE) for the three core algorithms.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115325289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RVNet: A fast and high energy efficiency network packet processing system on RISC-V","authors":"Yanpeng Wang, M. Wen, Chunyuan Zhang, Jie Lin","doi":"10.1109/ASAP.2017.7995266","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995266","url":null,"abstract":"RISC-V is a new open-source general-purpose instruction set architecture (ISA) developed by the University of California, Berkeley. It allows everyone to design their hardware circuits based on application characteristics and can be used in embedded devices, desktop computer and high-performance servers. In this paper, we use the RISC-V processor to design a fast network packet processing system. It aims to use less power and lower price to provide a faster network data processing capability for upper-layer applications in SDN and NFV. According to the results in our prototype on Field Programmable Gate Array (FPGA), our system has a comparable performance with DPDK, one of the fastest packet processing frameworks on the ×86 platform. It is worth mentioning that our system has higher (about 7.75 times) network packets processing energy efficiency than DPDK.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134560253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}