2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)最新文献_第4页

Area Efficient Box Filter Acceleration by Parallelizing with Optimized Adder Tree 基于优化加法器树并行化的区域高效箱形滤波器加速

2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) Pub Date : 2019-07-01 DOI: 10.1109/ISVLSI.2019.00019

Xinzhe Liu, Fupeng Chen, Y. Ha

{"title":"Area Efficient Box Filter Acceleration by Parallelizing with Optimized Adder Tree","authors":"Xinzhe Liu, Fupeng Chen, Y. Ha","doi":"10.1109/ISVLSI.2019.00019","DOIUrl":"https://doi.org/10.1109/ISVLSI.2019.00019","url":null,"abstract":"Box filters are widely used in image and video processing applications. To achieve the real-time performance for these applications, designers may need to parallelize these box filters. However, it is very challenging to implement a parallel box filter on modern programmable system-on-chip (SoC). On one hand, the dependency between the operations of a box filter is too strong to achieve parallelism. On the other hand, more adder trees are required as the degree of parallelism increases. In this paper, we propose a performance and area efficient boxfilter. It uses the partial sum difference, which needs much less resources, to effectively calculate the box filter. We make the full use of this reusable partial sum to optimize the adder trees for parallel processing. We also make two case studies of the box filter by applying it to the guided filter and the stereo matching algorithm on a programmable SoC using a C-based design flow. Our method removes the dependencies between the parallel operations of the box filter. Compare to the state-of-the-art, results show that the computational complexity of the adder tree for a single pixel has been reduced from O(R^2) to O((R+N)lgN/N ) on average. There are orders of magnitude reduction in resource usage with large filter size R and parallelization degree N. The throughput can be increased by N times, where N is up to 72 in the case of Xilinx FPGA board XCZU9EG.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"30 1","pages":"55-60"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81320133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Automated Communication and Floorplan-Aware Hardware/Software Co-Design for SoC SoC的自动通信和平面感知软硬件协同设计

2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) Pub Date : 2019-07-01 DOI: 10.1109/ISVLSI.2019.00032

Jong Bin Lim, Deming Chen

{"title":"Automated Communication and Floorplan-Aware Hardware/Software Co-Design for SoC","authors":"Jong Bin Lim, Deming Chen","doi":"10.1109/ISVLSI.2019.00032","DOIUrl":"https://doi.org/10.1109/ISVLSI.2019.00032","url":null,"abstract":"The main objective of modern SoC (System-on-Chip) designs is to achieve high-performance while maintaining low power consumption and resource usage. However, achieving such a goal is a difficult and time-consuming engineering task due to the vast design space of hardware accelerators and HW/SW task partitioning. Depending on the partitioning decision, communication between parts of the SoC must be also optimized such that the overall runtime including both computation and communication would be fast. In this paper, we propose an automated approach to iteratively search for a near-optimal SoC design with minimum latency within the targeted power and resource budget. Our approach consists of the following main components: (1) polyhedral-model-based hardware accelerator design space exploration, (2) modeling of various communication types and integration into LLVM-based integer linear programming for HW/SW task partitioning, (3) fast and efficient search algorithm to extract maximum operating frequency using floorplanner, and (4) back-annotation of extracted information to system level for iterative partitioning. Using FPGA as the target platform, we demonstrate that our approach consistently outperforms the previous state-of-the-art solutions for automated HW/SW co-design by 37.8% on average and up to 75.2% for certain designs.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"20 1","pages":"128-133"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81865380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Accelerating Compact Convolutional Neural Networks with Multi-threaded Data Streaming 多线程数据流加速紧凑卷积神经网络

2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) Pub Date : 2019-07-01 DOI: 10.1109/ISVLSI.2019.00099

Weiguang Chen, Z. Wang, Shanliao Li, Zhibin Yu, Huijuan Li

引用次数: 4

A 1.8mW Perception Chip with Near-Sensor Processing Scheme for Low-Power AIoT Applications 基于近传感器处理方案的1.8mW低功耗AIoT感知芯片

2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) Pub Date : 2019-07-01 DOI: 10.1109/ISVLSI.2019.00087

Zheyu Liu, Erxiang Ren, Li Luo, Qi Wei, Xing Wu, Xueqing Li, F. Qiao, Xinjun Liu, Huazhong Yang

引用次数: 7

Formal Verification of Integer Dividers:Division by a Constant 整数除法的形式化验证:被常数除法

2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) Pub Date : 2019-07-01 DOI: 10.1109/ISVLSI.2019.00022

Atif Yasin, Tiankai Su, S. Pillement, M. Ciesielski

引用次数: 6

Security in Many-Core SoCs Leveraged by Opaque Secure Zones 利用不透明安全区域的多核soc中的安全性

2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) Pub Date : 2019-07-01 DOI: 10.1109/ISVLSI.2019.00091

L. L. Caimi, F. Moraes

引用次数: 11

Title Page iii 第三页标题

2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) Pub Date : 2019-07-01 DOI: 10.1109/isvlsi.2019.00002

引用次数: 0

Traffic Driven Automated Synthesis of Network-on-Chip from Physically Aware Behavioral Specification 基于物理感知行为规范的流量驱动的片上网络自动合成

2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) Pub Date : 2019-07-01 DOI: 10.1109/ISVLSI.2019.00031

Anup Gangwar, Zheng Xu, N. Agarwal, Ravishankar Sreedharan, Ambica Prasad

{"title":"Traffic Driven Automated Synthesis of Network-on-Chip from Physically Aware Behavioral Specification","authors":"Anup Gangwar, Zheng Xu, N. Agarwal, Ravishankar Sreedharan, Ambica Prasad","doi":"10.1109/ISVLSI.2019.00031","DOIUrl":"https://doi.org/10.1109/ISVLSI.2019.00031","url":null,"abstract":"The process of laying out the various interconnect components and configuring them, is termed as interconnect synthesis. A Network-on-Chip (NoC), has various building blocks such as Routers, Resizers, Power and Clock domain converters (PCDCs), Pipeline elements etc. A software tool is needed to connect these various components (topology) and then configure them (including routing) so that the user constraints are met and the overall logic and wiring cost along with power is kept low. In this paper we present a tool which generates Power, Performance and Area (PPA) optimized NoCs. The input is a behavioral specification which consists of a rough floor-plan, bridge parameters, available clock, power and voltage domains, address spaces, stochastic traffic (including classes and latency criticality), traffic dependency and any partial topology for the locked down portions of the NoC. The output is an optimized NoC, with instantiation and placement of components (routers, Resizers etc.), Virtual Channel (VC) assignments, clockdomain assignments, routing, bridge parameter tuning, FIFO sizes etc. Using this flow, we are able to generate NoCs which are within 15% of the hand-tuned designs (optimized over several months), for various metrics and exceed critical metrics by as much as 30%.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"33 1","pages":"122-127"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79742167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Low-Complexity RS Decoder for Triple-Error-Correcting RS Codes 一种用于三纠错RS码的低复杂度RS解码器

2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) Pub Date : 2019-07-01 DOI: 10.1109/ISVLSI.2019.00094

Zengchao Yan, Jun Lin, Zhongfeng Wang

{"title":"A Low-Complexity RS Decoder for Triple-Error-Correcting RS Codes","authors":"Zengchao Yan, Jun Lin, Zhongfeng Wang","doi":"10.1109/ISVLSI.2019.00094","DOIUrl":"https://doi.org/10.1109/ISVLSI.2019.00094","url":null,"abstract":"Reed-Solomon (RS) codes have been widely used in digital communication and storage systems. The commonly used decoding algorithms include Berlekamp-Massey (BM) algorithm and its variants such as the inversionless BM (iBM) and the Reformulated inversionless BM (RiBM). All these algorithms require the computation-intensive procedures including key equation solver (KES), and Chien Search & Forney algorithm (CS&F). For RS codes with the error correction ability t≤ 2, it is known that error locations and magnitudes can be found through direct equation solver. However, for RS codes with t=3, no such work has been reported yet. In this paper, a low-complexity algorithm for triple-error-correcting RS codes is proposed. Moreover, an optimized architecture for the proposed algorithm is developed. For a (255, 239) RS code over GF(2^8), the synthesis results show that the area-efficiency of the proposed decoder is 217% higher than that of the conventional RiBM-based RS decoder in 4-parallel. As the degree of parallelism increases, the area-efficiency is increased to 364% in the 16-parallel architecture. The synthesis results show that the proposed decoder for the given example RS code can achieve a throughput as large as 124 Gb/s.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"22 1","pages":"489-494"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82270989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Fast-ABC: A Fast Architecture for Bottleneck-Like Based Convolutional Neural Networks Fast- abc:基于瓶颈的卷积神经网络的快速架构

2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) Pub Date : 2019-07-01 DOI: 10.1109/ISVLSI.2019.00010

Xiaoru Xie, Fangxuan Sun, Jun Lin, Zhongfeng Wang

{"title":"Fast-ABC: A Fast Architecture for Bottleneck-Like Based Convolutional Neural Networks","authors":"Xiaoru Xie, Fangxuan Sun, Jun Lin, Zhongfeng Wang","doi":"10.1109/ISVLSI.2019.00010","DOIUrl":"https://doi.org/10.1109/ISVLSI.2019.00010","url":null,"abstract":"In recent years, studies on efficient inference of neural networks have become one of the most popular research fields. In order to reduce the required number of computations and weights, many efforts have been made to construct light weight networks (LWNs) where bottleneck-like operations (BLOs) have been widely adopted. However, most current hardware accelerators are not able to utilize the optimization space for BLOs. This paper firstly show that the conventional computational flows employed by most existing accelerators will incur extremely low resource utilization ratio due to the extremely high DRAM bandwidth requirements in these LWNs via both theoretic analysis and experimental results. To address this issue, a partial fusion strategy which can drastically reduce bandwidth requirement is proposed. Additionaly, Winograd algorithm is also employed to further reduce the computational complexity. Based on these, an efficient accelerator for BLO-based networks called Fast Architecture for Bottleneck-like based Convolutional neural networks (Fast-ABC) is proposed. Fast-ABC is implemented on Altera Stratix V GSMD8, and can achieve a very high throughput of up to 137 fps and 264 fps on ResNet-18 and MobileNetV2, respectively. Implementation results show that the proposed architecture significantly improve the throughput on LWNs compared with the prior arts with even much less resources cost.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"32 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85850122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4