2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)最新文献

筛选
英文 中文
Area Efficient Box Filter Acceleration by Parallelizing with Optimized Adder Tree 基于优化加法器树并行化的区域高效箱形滤波器加速
2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) Pub Date : 2019-07-01 DOI: 10.1109/ISVLSI.2019.00019
Xinzhe Liu, Fupeng Chen, Y. Ha
{"title":"Area Efficient Box Filter Acceleration by Parallelizing with Optimized Adder Tree","authors":"Xinzhe Liu, Fupeng Chen, Y. Ha","doi":"10.1109/ISVLSI.2019.00019","DOIUrl":"https://doi.org/10.1109/ISVLSI.2019.00019","url":null,"abstract":"Box filters are widely used in image and video processing applications. To achieve the real-time performance for these applications, designers may need to parallelize these box filters. However, it is very challenging to implement a parallel box filter on modern programmable system-on-chip (SoC). On one hand, the dependency between the operations of a box filter is too strong to achieve parallelism. On the other hand, more adder trees are required as the degree of parallelism increases. In this paper, we propose a performance and area efficient boxfilter. It uses the partial sum difference, which needs much less resources, to effectively calculate the box filter. We make the full use of this reusable partial sum to optimize the adder trees for parallel processing. We also make two case studies of the box filter by applying it to the guided filter and the stereo matching algorithm on a programmable SoC using a C-based design flow. Our method removes the dependencies between the parallel operations of the box filter. Compare to the state-of-the-art, results show that the computational complexity of the adder tree for a single pixel has been reduced from O(R^2) to O((R+N)lgN/N ) on average. There are orders of magnitude reduction in resource usage with large filter size R and parallelization degree N. The throughput can be increased by N times, where N is up to 72 in the case of Xilinx FPGA board XCZU9EG.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"30 1","pages":"55-60"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81320133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Automated Communication and Floorplan-Aware Hardware/Software Co-Design for SoC SoC的自动通信和平面感知软硬件协同设计
2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) Pub Date : 2019-07-01 DOI: 10.1109/ISVLSI.2019.00032
Jong Bin Lim, Deming Chen
{"title":"Automated Communication and Floorplan-Aware Hardware/Software Co-Design for SoC","authors":"Jong Bin Lim, Deming Chen","doi":"10.1109/ISVLSI.2019.00032","DOIUrl":"https://doi.org/10.1109/ISVLSI.2019.00032","url":null,"abstract":"The main objective of modern SoC (System-on-Chip) designs is to achieve high-performance while maintaining low power consumption and resource usage. However, achieving such a goal is a difficult and time-consuming engineering task due to the vast design space of hardware accelerators and HW/SW task partitioning. Depending on the partitioning decision, communication between parts of the SoC must be also optimized such that the overall runtime including both computation and communication would be fast. In this paper, we propose an automated approach to iteratively search for a near-optimal SoC design with minimum latency within the targeted power and resource budget. Our approach consists of the following main components: (1) polyhedral-model-based hardware accelerator design space exploration, (2) modeling of various communication types and integration into LLVM-based integer linear programming for HW/SW task partitioning, (3) fast and efficient search algorithm to extract maximum operating frequency using floorplanner, and (4) back-annotation of extracted information to system level for iterative partitioning. Using FPGA as the target platform, we demonstrate that our approach consistently outperforms the previous state-of-the-art solutions for automated HW/SW co-design by 37.8% on average and up to 75.2% for certain designs.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"20 1","pages":"128-133"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81865380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Accelerating Compact Convolutional Neural Networks with Multi-threaded Data Streaming 多线程数据流加速紧凑卷积神经网络
2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) Pub Date : 2019-07-01 DOI: 10.1109/ISVLSI.2019.00099
Weiguang Chen, Z. Wang, Shanliao Li, Zhibin Yu, Huijuan Li
{"title":"Accelerating Compact Convolutional Neural Networks with Multi-threaded Data Streaming","authors":"Weiguang Chen, Z. Wang, Shanliao Li, Zhibin Yu, Huijuan Li","doi":"10.1109/ISVLSI.2019.00099","DOIUrl":"https://doi.org/10.1109/ISVLSI.2019.00099","url":null,"abstract":"Recent advances in convolutional neural networks (CNNs) reveal the trend towards designing compact structures such as MobileNet, which adopts variations of traditional computing kernels such as pointwise and depthwise convolution. Such modified operations significantly reduce model size with an only slight degradation in inference accuracy. State-of-the-art neural accelerators have not yet fully exploit algorithmic parallelism for such computing kernels in compact CNNs. In this work, we propose a multithreaded data streaming architecture for fast and highly parallel execution of pointwise and depthwise convolution, which can be also dynamically reconfigured to process conventional convolution, pooling, and fully connected network layers. The architecture achieves efficient memory bandwidth utilization by exploiting two modes of data alignment. We profile MobileNet on the proposed architecture and demonstrate a 9:36x speed-up compared to single-threaded architecture.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"14 1","pages":"519-522"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84611013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
A 1.8mW Perception Chip with Near-Sensor Processing Scheme for Low-Power AIoT Applications 基于近传感器处理方案的1.8mW低功耗AIoT感知芯片
2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) Pub Date : 2019-07-01 DOI: 10.1109/ISVLSI.2019.00087
Zheyu Liu, Erxiang Ren, Li Luo, Qi Wei, Xing Wu, Xueqing Li, F. Qiao, Xinjun Liu, Huazhong Yang
{"title":"A 1.8mW Perception Chip with Near-Sensor Processing Scheme for Low-Power AIoT Applications","authors":"Zheyu Liu, Erxiang Ren, Li Luo, Qi Wei, Xing Wu, Xueqing Li, F. Qiao, Xinjun Liu, Huazhong Yang","doi":"10.1109/ISVLSI.2019.00087","DOIUrl":"https://doi.org/10.1109/ISVLSI.2019.00087","url":null,"abstract":"In the past few years, the demand for intelligence of IoT front-end devices has dramatically increased. However, such devices face challenges of limited on-chip resources and strict power or energy constraints. Recent progress in binarized neural networks has provided promising solutions for front-end processing system to conduct simple detection and classification tasks by making trade-offs between the processing quality and the computation complexity. In this paper, we propose a mixed-signal perception chip, in which an ADC-free 32x32 image sensor and a BNN processing array are directly integrated with a 180nm standard CMOS process. Taking advantage of the ADC-free processing architecture, the whole processing system only consumes 1.8mW power, while providing up to 545.4 GOPS/W energy efficiency. The implementation performance and energy efficiency are comparable with the state-of-the-art designs in much more advanced CMOS technologies. This work provides a promising alternative for low-power IoT intelligent applications.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"42 1","pages":"447-452"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78194231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Formal Verification of Integer Dividers:Division by a Constant 整数除法的形式化验证:被常数除法
2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) Pub Date : 2019-07-01 DOI: 10.1109/ISVLSI.2019.00022
Atif Yasin, Tiankai Su, S. Pillement, M. Ciesielski
{"title":"Formal Verification of Integer Dividers:Division by a Constant","authors":"Atif Yasin, Tiankai Su, S. Pillement, M. Ciesielski","doi":"10.1109/ISVLSI.2019.00022","DOIUrl":"https://doi.org/10.1109/ISVLSI.2019.00022","url":null,"abstract":"Division is one of the most complex and hard to verify arithmetic operations. While verification of major arithmetic operators, such as adders and multipliers, has significantly progressed in recent years, less attention has been devoted to formal verification of dividers. A type of divider that is often used in embedded systems is divide by a constant. This paper presents a formal verification method for different divide-by-constant architectures and the generic restoring dividers based on the computer algebra approach. Our experiments for different divider architectures and comparison with exhaustive simulation demonstrates the effectiveness and scalability of the method.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"23 1","pages":"76-81"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78586232","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Security in Many-Core SoCs Leveraged by Opaque Secure Zones 利用不透明安全区域的多核soc中的安全性
2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) Pub Date : 2019-07-01 DOI: 10.1109/ISVLSI.2019.00091
L. L. Caimi, F. Moraes
{"title":"Security in Many-Core SoCs Leveraged by Opaque Secure Zones","authors":"L. L. Caimi, F. Moraes","doi":"10.1109/ISVLSI.2019.00091","DOIUrl":"https://doi.org/10.1109/ISVLSI.2019.00091","url":null,"abstract":"This paper presents an original approach to protect the execution of applications with security constraints in many-core systems. The proposed method includes three defense mechanisms. The first one is the application admission into the many-core using ECDH and MAC techniques. The second is the spatial reservation of computation and communication resources, resulting in an Opaque Secure Zone (OSZ). The key feature enabling the runtime creation of OSZs is a rerouting mechanism responsible for deviating any traffic traversing an OSZ. The last mechanism is the access to peripherals using a secure protocol to open access points in the OSZ border, and lightweight encryption mechanisms.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"23 1","pages":"471-476"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74035486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Title Page iii 第三页标题
2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) Pub Date : 2019-07-01 DOI: 10.1109/isvlsi.2019.00002
{"title":"Title Page iii","authors":"","doi":"10.1109/isvlsi.2019.00002","DOIUrl":"https://doi.org/10.1109/isvlsi.2019.00002","url":null,"abstract":"","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"81 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79303706","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Traffic Driven Automated Synthesis of Network-on-Chip from Physically Aware Behavioral Specification 基于物理感知行为规范的流量驱动的片上网络自动合成
2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) Pub Date : 2019-07-01 DOI: 10.1109/ISVLSI.2019.00031
Anup Gangwar, Zheng Xu, N. Agarwal, Ravishankar Sreedharan, Ambica Prasad
{"title":"Traffic Driven Automated Synthesis of Network-on-Chip from Physically Aware Behavioral Specification","authors":"Anup Gangwar, Zheng Xu, N. Agarwal, Ravishankar Sreedharan, Ambica Prasad","doi":"10.1109/ISVLSI.2019.00031","DOIUrl":"https://doi.org/10.1109/ISVLSI.2019.00031","url":null,"abstract":"The process of laying out the various interconnect components and configuring them, is termed as interconnect synthesis. A Network-on-Chip (NoC), has various building blocks such as Routers, Resizers, Power and Clock domain converters (PCDCs), Pipeline elements etc. A software tool is needed to connect these various components (topology) and then configure them (including routing) so that the user constraints are met and the overall logic and wiring cost along with power is kept low. In this paper we present a tool which generates Power, Performance and Area (PPA) optimized NoCs. The input is a behavioral specification which consists of a rough floor-plan, bridge parameters, available clock, power and voltage domains, address spaces, stochastic traffic (including classes and latency criticality), traffic dependency and any partial topology for the locked down portions of the NoC. The output is an optimized NoC, with instantiation and placement of components (routers, Resizers etc.), Virtual Channel (VC) assignments, clockdomain assignments, routing, bridge parameter tuning, FIFO sizes etc. Using this flow, we are able to generate NoCs which are within 15% of the hand-tuned designs (optimized over several months), for various metrics and exceed critical metrics by as much as 30%.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"33 1","pages":"122-127"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79742167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Low-Complexity RS Decoder for Triple-Error-Correcting RS Codes 一种用于三纠错RS码的低复杂度RS解码器
2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) Pub Date : 2019-07-01 DOI: 10.1109/ISVLSI.2019.00094
Zengchao Yan, Jun Lin, Zhongfeng Wang
{"title":"A Low-Complexity RS Decoder for Triple-Error-Correcting RS Codes","authors":"Zengchao Yan, Jun Lin, Zhongfeng Wang","doi":"10.1109/ISVLSI.2019.00094","DOIUrl":"https://doi.org/10.1109/ISVLSI.2019.00094","url":null,"abstract":"Reed-Solomon (RS) codes have been widely used in digital communication and storage systems. The commonly used decoding algorithms include Berlekamp-Massey (BM) algorithm and its variants such as the inversionless BM (iBM) and the Reformulated inversionless BM (RiBM). All these algorithms require the computation-intensive procedures including key equation solver (KES), and Chien Search & Forney algorithm (CS&F). For RS codes with the error correction ability t≤ 2, it is known that error locations and magnitudes can be found through direct equation solver. However, for RS codes with t=3, no such work has been reported yet. In this paper, a low-complexity algorithm for triple-error-correcting RS codes is proposed. Moreover, an optimized architecture for the proposed algorithm is developed. For a (255, 239) RS code over GF(2^8), the synthesis results show that the area-efficiency of the proposed decoder is 217% higher than that of the conventional RiBM-based RS decoder in 4-parallel. As the degree of parallelism increases, the area-efficiency is increased to 364% in the 16-parallel architecture. The synthesis results show that the proposed decoder for the given example RS code can achieve a throughput as large as 124 Gb/s.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"22 1","pages":"489-494"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82270989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Fast-ABC: A Fast Architecture for Bottleneck-Like Based Convolutional Neural Networks Fast- abc:基于瓶颈的卷积神经网络的快速架构
2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI) Pub Date : 2019-07-01 DOI: 10.1109/ISVLSI.2019.00010
Xiaoru Xie, Fangxuan Sun, Jun Lin, Zhongfeng Wang
{"title":"Fast-ABC: A Fast Architecture for Bottleneck-Like Based Convolutional Neural Networks","authors":"Xiaoru Xie, Fangxuan Sun, Jun Lin, Zhongfeng Wang","doi":"10.1109/ISVLSI.2019.00010","DOIUrl":"https://doi.org/10.1109/ISVLSI.2019.00010","url":null,"abstract":"In recent years, studies on efficient inference of neural networks have become one of the most popular research fields. In order to reduce the required number of computations and weights, many efforts have been made to construct light weight networks (LWNs) where bottleneck-like operations (BLOs) have been widely adopted. However, most current hardware accelerators are not able to utilize the optimization space for BLOs. This paper firstly show that the conventional computational flows employed by most existing accelerators will incur extremely low resource utilization ratio due to the extremely high DRAM bandwidth requirements in these LWNs via both theoretic analysis and experimental results. To address this issue, a partial fusion strategy which can drastically reduce bandwidth requirement is proposed. Additionaly, Winograd algorithm is also employed to further reduce the computational complexity. Based on these, an efficient accelerator for BLO-based networks called Fast Architecture for Bottleneck-like based Convolutional neural networks (Fast-ABC) is proposed. Fast-ABC is implemented on Altera Stratix V GSMD8, and can achieve a very high throughput of up to 137 fps and 264 fps on ResNet-18 and MobileNetV2, respectively. Implementation results show that the proposed architecture significantly improve the throughput on LWNs compared with the prior arts with even much less resources cost.","PeriodicalId":6703,"journal":{"name":"2019 IEEE Computer Society Annual Symposium on VLSI (ISVLSI)","volume":"32 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2019-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85850122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信