{"title":"A Memory-Efficient Hardware Architecture for Deformable Convolutional Networks","authors":"Yue Yu, Jiapeng Luo, W. Mao, Zhongfeng Wang","doi":"10.1109/SiPS52927.2021.00033","DOIUrl":"https://doi.org/10.1109/SiPS52927.2021.00033","url":null,"abstract":"In recent years, deformable convolutional networks are widely adopted in object detection tasks and have achieved outstanding performance. Compared with conventional convolution, the deformable convolution has an irregular receptive field to adapt to objects with different sizes and shapes. However, the irregularity of the receptive field causes inefficient access to memory and increases the complexity of control logic. Toward hardware-friendly implementation, prior works change the characteristics of deformable convolution by restricting the receptive field, leading to accuracy degradation. In this paper, we develop a dedicated Sampling Core to sample and rearrange the input pixels, enabling the convolution array to access the inputs regularly. In addition, a memory-efficient dataflow is introduced to match the processing speed of the Sampling Core and convolutional array, which improves hardware utilization and reduces access to off-chip memory. Based on these optimizations, we propose a novel hardware architecture for the deformable convolution network, which is the first work to accelerate the original deformable convolution network. With the design of the memory-efficient architecture, the access to the off-chip memory is reduced significantly. We implement it on Xilinx Virtex-7 FPGA, and experiments show that the energy efficiency reaches 50.29 GOPS/W, which is 2.5 times higher compared with executing the same network on GPU.","PeriodicalId":103894,"journal":{"name":"2021 IEEE Workshop on Signal Processing Systems (SiPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130240482","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploration of Energy-Efficient Architecture for Graph-Based Point-Cloud Deep Learning","authors":"Jie-Fang Zhang, Zhengya Zhang","doi":"10.1109/SiPS52927.2021.00054","DOIUrl":"https://doi.org/10.1109/SiPS52927.2021.00054","url":null,"abstract":"Deep learning on point clouds has attracted increasing attention in the fields of 3D computer vision and robotics. In particular, graph-based point-cloud deep neural networks (DNNs) have demonstrated promising performance in 3D object classification and scene segmentation tasks. However, the scattered and irregular graph-structured data in a graph-based point-cloud DNN cannot be computed efficiently by existing SIMD architectures and accelerators. Following a review of the challenges of point-cloud DNN and the key edge convolution operation, we provide several directions in optimizing the processing architecture, including computation model, data reuse, and data locality, for achieving an effective acceleration and an improved energy efficiency.","PeriodicalId":103894,"journal":{"name":"2021 IEEE Workshop on Signal Processing Systems (SiPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130252331","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"[Copyright notice]","authors":"","doi":"10.1109/sips52927.2021.00003","DOIUrl":"https://doi.org/10.1109/sips52927.2021.00003","url":null,"abstract":"","PeriodicalId":103894,"journal":{"name":"2021 IEEE Workshop on Signal Processing Systems (SiPS)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130420955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yue Dai, Harrison Liew, M. Rasekh, Seyed Hadi Mirfarshbafan, Alexandra Gallyas-Sanhueza, James Dunn, Upamanyu Madhow, Christoph Studer, B. Nikolić
{"title":"A Scalable Generator for Massive MIMO Baseband Processing Systems with Beamspace Channel Estimation","authors":"Yue Dai, Harrison Liew, M. Rasekh, Seyed Hadi Mirfarshbafan, Alexandra Gallyas-Sanhueza, James Dunn, Upamanyu Madhow, Christoph Studer, B. Nikolić","doi":"10.1109/SiPS52927.2021.00040","DOIUrl":"https://doi.org/10.1109/SiPS52927.2021.00040","url":null,"abstract":"This paper describes a scalable, highly portable, and energy-efficient generator for massive multiple-input multiple-output (MIMO) baseband processing systems. This generator is written in Chisel and produces hardware instances for a scalable massive MIMO system employing distributed processing. The generator is parameterized in both the MIMO system and hardware datapath elements. Coupled with a Python-based system simulator, the generator can be adapted to implement other baseband processing algorithms. To demonstrate the adaptability, several generator instances with different parameter values are evaluated by FPGA emulation. In addition, a beamspace calibration and channel denoising algorithm are applied to further improve the channel estimation performance. With those algorithms, the error vector magnitude can be reduced by up 9.2%.","PeriodicalId":103894,"journal":{"name":"2021 IEEE Workshop on Signal Processing Systems (SiPS)","volume":"5 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130096691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shize Zhao, Liulu He, Xiaoru Xie, Jun Lin, Zhongfeng Wang
{"title":"Automatic Generation of Dynamic Inference Architecture for Deep Neural Networks","authors":"Shize Zhao, Liulu He, Xiaoru Xie, Jun Lin, Zhongfeng Wang","doi":"10.1109/SiPS52927.2021.00029","DOIUrl":"https://doi.org/10.1109/SiPS52927.2021.00029","url":null,"abstract":"The computational cost of deep neural network(DNN) model can be reduced dramatically by applying different architectures based on the difficulties of each sample, which is named dynamic inference tech. Manually designed dynamic inference framework is hard to be optimal for the dependency on human experience, which is also time-consuming and labor-intensive. In this paper, we provide an auto-designed AB-Net based on the popular dynamic framework BranchyNet, which is inspired by neural architecture search (NAS). To further accelerate the search procedure, we also develop several specific techniques. Firstly, the search space is optimized by the pre-selection of candidate architectures. Then, a neighborhood greedy search algorithm is developed to efficiently find the optimal architecture in the improved search space. Moreover, our scheme can be extended to the multiple-branch cases to further enhance the performance of the AB-Net. We apply the AB-Net on multiple mainstream models and evaluate them on datasets CIFAR10/100. Compared to the handcrafted BranchyNet, the proposed AB-Net is able to achieve 1.57× computational cost reduction at least even with slight accuracy improvement on CIFAR100. Moreover, the AB-Net also significantly outperforms the S2DNAS on accuracy with similar cost reduction, which is the state-of-the-art automatic dynamic inference architecture.","PeriodicalId":103894,"journal":{"name":"2021 IEEE Workshop on Signal Processing Systems (SiPS)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129547743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Reconfigurable Neural Synaptic Plasticity-Based Stochastic Deep Neural Network Computing","authors":"Zihan Xia, Ya Dong, Jienan Chen, Rui Wan, Shuai Li, Tingyong Wu","doi":"10.1109/SiPS52927.2021.00048","DOIUrl":"https://doi.org/10.1109/SiPS52927.2021.00048","url":null,"abstract":"With the increasing popularity of deep neural networks (DNNs), a large amount of research effort has been devoted to the hardware acceleration of DNNs to achieve efficient processing. Nevertheless, few works have explored the similarities between the biological essence of DNNs and arithmetic circuits. Moreover, stochastic computing (SC), which implements complex arithmetic operations with simple logic gates, has been applied to the acceleration of DNNs. However, traditional SC suffers from high latency and large hardware cost of pseudo-random number generators (PRNGs). Inspired by neural synaptic plasticity and SC, in this work, we present the reconfigurable neural synaptic plasticity-based computing (RNSP) to mimic the biological neuron behaviors and exploit the parallelism of SC to the full extent while maintaining a small hardware footprint compared to fixed-point counterparts. RNSP converts fixed-point numbers to parallel bits without logic resources, which are then synthesized by bit-wise multiplications and some full adders. In addition, we propose the arithmetic unit based on RNSP and use re-training to mitigate the accuracy degradation. Finally, a convolution engine (CE) built on RNSP with high memory bandwidth efficiency is designed. According to the implementation results on FPGA, the proposed RNSP-based CE outperforms the fixed-point counterpart in terms of power consumption and area.","PeriodicalId":103894,"journal":{"name":"2021 IEEE Workshop on Signal Processing Systems (SiPS)","volume":"7 2-3","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114046739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fully Convolutional Network-Based DOA Estimation with Acoustic Vector Sensor","authors":"Sifan Wang, J. Geng, Xin Lou","doi":"10.1109/SiPS52927.2021.00014","DOIUrl":"https://doi.org/10.1109/SiPS52927.2021.00014","url":null,"abstract":"In this paper, a learning-based direction of arrival (DOA) estimation pipeline for acoustic vector sensor (AVS) is proposed. In the proposed pipeline, a fully convolutional network (FCN) is introduced for uncontaminated time-frequency (TF) point extraction, which is a crucial step for AVS-based DOA estimation. Unlike conventional direct path dominant (DPD) or single source points (SSP) detection, the uncontaminated TF point extraction problem is modeled as an image segmentation problem, where the direct DOA cues from the spatial response of AVS is utilized for ground truth labeling to generate the training data of the network. With the extracted uncontaminated TF points, the final DOA can be generated using the proposed fuzzy geometric median (FGM) clustering. Simulation results show that the proposed pipeline is capable of improving the accuracy in the cases of small angular difference between acoustic sources and improving robustness in strong reverberation and noise situations.","PeriodicalId":103894,"journal":{"name":"2021 IEEE Workshop on Signal Processing Systems (SiPS)","volume":"292 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131923444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fault-Tolerance of Binarized and Stochastic Computing-based Neural Networks","authors":"Amir Ardakani, A. Ardakani, W. Gross","doi":"10.1109/SiPS52927.2021.00018","DOIUrl":"https://doi.org/10.1109/SiPS52927.2021.00018","url":null,"abstract":"Both binarized and stochastic computing-based neural networks exploit bit-wise operations to replace expensive full-precision multiplications with simple XNOR gates and thus, offer low-cost hardware implementation. In stochastic computing, arithmetic computations are performed on sequences of random bits which can approximate any real values. Stochastic computing-based neural networks benefit from approximate computing and promote fault-tolerant architectures against soft errors in noisy environments. On the other hand, in binarized neural networks, real values are deterministically binarized using the sign function. As a result, any bit-flip in the binarized values dramatically changes the outcome of arithmetic computations and makes binarized neural networks more vulnerable against soft errors. In this paper, we compare these two neural networks against each other in terms of fault-tolerance and hardware complexity (i.e., area and energy efficiency).","PeriodicalId":103894,"journal":{"name":"2021 IEEE Workshop on Signal Processing Systems (SiPS)","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116815597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hartley Stochastic Computing For Convolutional Neural Networks","authors":"S. H. Mozafari, J. Clark, W. Gross, B. Meyer","doi":"10.1109/SiPS52927.2021.00049","DOIUrl":"https://doi.org/10.1109/SiPS52927.2021.00049","url":null,"abstract":"Energy consumption and the latency of convolutional neural networks (CNNs) are two important factors that limit their applications specifically for embedded devices. Fourier-based frequency domain (FD) convolution is a promising low-cost alter-native to conventional implementations in the spatial domain (SD) for CNNs. FD convolution performs its operation with point-wise multiplications. However, in CNNs, the overhead for the Fourier-based FD-convolution surpasses its computational saving for small filter sizes. In this work, we propose to implement convolutional layers in the FD using the Hartley transformation (HT) instead of the Fourier transformation. We show that the HT can reduce the convolution delay and energy consumption even for small filters. With the HT of parameters, we replace convolution with point-wise multiplications. HT lets us compress input feature maps, in all convolutional layer, before convolving them with filters. To optimize the hardware implementation of our method, we utilize stochastic computing (SC) to perform the point-wise multiplications in the FD. In this regard, we re-formalize the HT to better match with SC. We show that, compared to conventional Fourier-based convolution, Hartley SC-based convolution can achieve 1.33x speedup, and 1.23x energy saving on a Virtex 7 FPGA when we implement AlexNet over CIFAR-10.","PeriodicalId":103894,"journal":{"name":"2021 IEEE Workshop on Signal Processing Systems (SiPS)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127675701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Novel Blind Detection Method and FPGA Implementation for Energy-Efficient Sidelink Communications","authors":"Chenhao Zhang, Haiqin Hu, Shan Cao, Zhiyuan Jiang","doi":"10.1109/SiPS52927.2021.00010","DOIUrl":"https://doi.org/10.1109/SiPS52927.2021.00010","url":null,"abstract":"A novel physical sidelink control channel (PSCCH) blind detection method based on demodulation reference signal (DMRS) detection is proposed for sidelink communications in cellular vehicular-to-everything (C-V2X). In the proposed method, the user equipment (UE) first performs coherent energy detection on the DMRS positions. According to the information of the time/frequency location where the DMRS is detected, the UE can adjust the decoding area to minimize unnecessary blind decoding attempts. Based on the proposed algorithm and the channel estimation method, a VLSI architecture of joint energy detection and channel estimation (JEC) is proposed. Reference implementation results for a Xilinx Virtex-7 FPGA show that our design can reduce hardware complexity and energy consumption.","PeriodicalId":103894,"journal":{"name":"2021 IEEE Workshop on Signal Processing Systems (SiPS)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131138738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}