Seokhyeon Choi, Kyuhong Shim, Jungwook Choi, Wonyong Sung, B. Shim
{"title":"TernGEMM: GEneral Matrix Multiply Library with Ternary Weights for Fast DNN Inference","authors":"Seokhyeon Choi, Kyuhong Shim, Jungwook Choi, Wonyong Sung, B. Shim","doi":"10.1109/SiPS52927.2021.00028","DOIUrl":"https://doi.org/10.1109/SiPS52927.2021.00028","url":null,"abstract":"Efficient implementation of deep neural networks on CPU-based systems is very critical because applications proliferate to embedded and Internet of Things (IoT) systems. Many CPUs for personal computers and embedded systems equip Single Instruction Multiple Data (SIMD) instructions, which can be utilized to implement an efficient GEneral Matrix Multiply (GEMM) library that is very necessary for efficient deep neural network implementation. While many deep neural networks show quite good performance even at 1-bit or 2-bit precision, the current CPU instruction and library do not efficiently support arithmetic operations below 8-bit. We propose TernGEMM, a special GEMM library using SIMD instructions for Deep Neural Network (DNN) inference with ternary weights and activations under 8-bit. TernGEMM improves the speed by replacing slow multiply-add with logical operations and also accumulating a number of multiplications without bit expansion operations. We compared the speedup of TernGEMM with tiling optimization and GEMMLowp, an 8-bit precision GEMM library. For Intel CPU, the speedup of ×2.052, ×2.973, and ×2.986 is achieved on ResNet-50, MobileNet-V2, EfficientNet-B0, respectively. For ARM CPU, TernGEMM’s speedup is ×2.143, ×1.765, and ×1.856, respectively.","PeriodicalId":103894,"journal":{"name":"2021 IEEE Workshop on Signal Processing Systems (SiPS)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115045448","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
G. Orchard, E. P. Frady, D. B. Rubin, S. Sanborn, S. Shrestha, F. Sommer, Mike Davies
{"title":"Efficient Neuromorphic Signal Processing with Loihi 2","authors":"G. Orchard, E. P. Frady, D. B. Rubin, S. Sanborn, S. Shrestha, F. Sommer, Mike Davies","doi":"10.1109/SiPS52927.2021.00053","DOIUrl":"https://doi.org/10.1109/SiPS52927.2021.00053","url":null,"abstract":"The biologically inspired spiking neurons used in neuromorphic computing are nonlinear filters with dynamic state variables—very different from the stateless neuron models used in deep learning. The next version of Intel's neuromorphic research processor, Loihi 2, supports a wide range of stateful spiking neuron models with fully programmable dynamics. Here we showcase advanced spiking neuron models that can be used to efficiently process streaming data in simulation experiments on emulated Loihi 2 hardware. In one example, Resonate-and-Fire (RF) neurons are used to compute the Short Time Fourier Transform (STFT) with similar computational complexity but 47x less output bandwidth than the conventional STFT. In another example, we describe an algorithm for optical flow estimation using spatiotemporal RF neurons that requires over 90x fewer operations than a conventional DNN-based solution. We also demonstrate promising preliminary results using backpropagation to train RF neurons for audio classification tasks. Finally, we show that a cascade of Hopf resonators—a variant of the RF neuron—replicates novel properties of the cochlea and motivates an efficient spike-based spectrogram encoder.","PeriodicalId":103894,"journal":{"name":"2021 IEEE Workshop on Signal Processing Systems (SiPS)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126686646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Stage-wise Conversion Strategy for Low-Latency Deformable Spiking CNN","authors":"Chunyu Wang, Jiapeng Luo, Zhongfeng Wang","doi":"10.1109/SiPS52927.2021.00009","DOIUrl":"https://doi.org/10.1109/SiPS52927.2021.00009","url":null,"abstract":"Spiking neural networks (SNNs) are currently one of the most successful approaches to model the behavior and learning potential of the brain. Recently, they have obtained marvelous research interest thanks to their event-driven and energy-efficient characteristics. While difficult to directly train SNNs from scratch because of their non-differentiable spike operations, many works have focused on converting a trained DNN to the target SNN. However, there is no efficient method to convert the deformable convolutional layer which is frequently used in many applications. The deformable convolution layer enables deformation of the convolutional sampling grid by adding offsets to the regular sampling locations, which enhances the geometric transformation modeling capability of CNNs. In this work, we propose a novel deformable spiking CNN, which can successfully convert DNNs with deformable convolution layers to SNNs with much shorter simulation time and have low latency during inference while maintaining high accuracy. To be specific, we design an effective method dedicated for deformable convolution layers to be converted. By treating the offset prediction module as an embedded SNN, we calculate the spiking offsets multi times and use the average values as the final offsets for deformable convolution. We also propose a stage-wise DNN-SNN conversion strategy to further reduce the conversion error. We divide the network into several stages and convert each stage sequentially with retraining to diminish the difference between the source DNN and the target SNN as much as possible. The experiments on CIFAR-10 and CIFAR-100 datasets show that our method surpasses the state-of-the-art works both in conversion accuracy and inference latency.","PeriodicalId":103894,"journal":{"name":"2021 IEEE Workshop on Signal Processing Systems (SiPS)","volume":"267 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133333101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Design and Implementation of a Highly Accurate Stochastic Spiking Neural Network","authors":"Chengcheng Tang, Jie Han","doi":"10.1109/SiPS52927.2021.00050","DOIUrl":"https://doi.org/10.1109/SiPS52927.2021.00050","url":null,"abstract":"The emergence of spiking neural networks (SNNs) provide a promising approach to the energy efficient design of artificial neural networks (ANNs). The rate encoded computation in SNNs utilizes the number of spikes in a time window to encode the intensity of a signal, in a similar way to the information encoding in stochastic computing. Inspired by this similarity, this paper presents a hardware design of stochastic SNNs that attains a high accuracy. A design framework is elaborated for the input, hidden and output layers. This design takes advantage of a priority encoder to convert the spikes between layers of neurons into index-based signals and uses the cumulative distribution function of the signals for spike train generation. Thus, it mitigates the problem of a relatively low information density and reduces the usage of hardware resources in SNNs. This design is implemented in field programmable gate arrays (FPGAs) and its performance is evaluated on the MNIST image recognition dataset. Hardware costs are evaluated for different sizes of hidden layers in the stochastic SNNs and the recognition accuracy is obtained using different lengths of stochastic sequences. The results show that this stochastic SNN framework achieves a higher accuracy compared to other SNN designs and a comparable accuracy as their ANN counterparts. Hence, the proposed SNN design can be an effective alternative to achieving high accuracy in hardware constrained applications.","PeriodicalId":103894,"journal":{"name":"2021 IEEE Workshop on Signal Processing Systems (SiPS)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127077346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable Hardware Architecture for Invertible Logic with Sparse Hamiltonian Matrices","authors":"N. Onizawa, A. Tamakoshi, T. Hanyu","doi":"10.1109/SiPS52927.2021.00047","DOIUrl":"https://doi.org/10.1109/SiPS52927.2021.00047","url":null,"abstract":"We introduce a scalable hardware architecture for large-scale invertible logic. Invertible logic has been recently presented that can realize bidirectional computing probabilis-tically based on Hamiltonians with a small number of non-zero elements. In order to store and compute the Hamiltonians efficiently in hardware, a sparse matrix representation of PTELL (partitioned and transposed ELLPACK) is proposed. A memory size of PTELL can be smaller than that of a conventional ELL by reducing the number of paddings while parallel reading of non-zero values are realized for high-throughput operations. As a result, the proposed scalable invertible-logic hardware based on PTELL is designed on Xilinx KC705 FPGA board, which achieves two orders of magnitude faster than an 8-core CPU implementation.","PeriodicalId":103894,"journal":{"name":"2021 IEEE Workshop on Signal Processing Systems (SiPS)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126098335","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sitian Li, Andreas Toftegaard Kristensen, A. Burg, Alexios Balatsoukas-Stimming
{"title":"ComplexBeat: Breathing Rate Estimation from Complex CSI","authors":"Sitian Li, Andreas Toftegaard Kristensen, A. Burg, Alexios Balatsoukas-Stimming","doi":"10.1109/SiPS52927.2021.00046","DOIUrl":"https://doi.org/10.1109/SiPS52927.2021.00046","url":null,"abstract":"In this paper, we explore the use of channel state information (CSI) from a WiFi system to estimate the breathing rate of a person in a room. In order to extract WiFi CSI components that are sensitive to breathing, we propose to consider the delay domain channel impulse response (CIR), while most state-of-the-art methods consider its frequency domain representation. One obstacle while processing the CSI data is that its amplitude and phase are highly distorted by measurement uncertainties. We thus also propose an amplitude calibration method and a phase offset calibration method for CSI measured in orthogonal frequency-division multiplexing (OFDM) multiple- input multiple-output (MIMO) systems. Finally, we implement a complete breathing rate estimation system in order to showcase the effectiveness of our proposed calibration and CSI extraction methods.","PeriodicalId":103894,"journal":{"name":"2021 IEEE Workshop on Signal Processing Systems (SiPS)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133257160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Understanding the Energy vs. Adversarial Robustness Trade-Off in Deep Neural Networks","authors":"Kyungmi Lee, A. Chandrakasan","doi":"10.1109/SiPS52927.2021.00017","DOIUrl":"https://doi.org/10.1109/SiPS52927.2021.00017","url":null,"abstract":"Adversarial examples, which are crafted by adding small inconspicuous perturbations to typical inputs in order to fool the prediction of a deep neural network (DNN), can pose a threat to security-critical applications, and robustness against adversarial examples is becoming an important factor for designing a DNN. In this work, we first examine the methodology for evaluating adversarial robustness that uses the first-order attack methods, and analyze three cases when this evaluation methodology overestimates robustness: 1) numerical saturation of cross-entropy loss, 2) non-differentiable functions in DNNs, and 3) ineffective initialization of the attack methods. For each case, we propose compensation methods that can be easily combined with the existing attack methods, thus provide a more precise evaluation methodology for robustness. Second, we benchmark the relationship between adversarial robustness and inference-time energy at an embedded hardware platform using our proposed evaluation methodology, and demonstrate that this relationship can be obscured by the three cases behind overestimation. Overall, our work shows that the robustness-energy trade-off has differences from the conventional accuracy-energy trade-off, and highlights importance of the precise evaluation methodology for robustness.","PeriodicalId":103894,"journal":{"name":"2021 IEEE Workshop on Signal Processing Systems (SiPS)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134300868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Design and Implementation of Autoencoder-LSTM Accelerator for Edge Outlier Detection","authors":"Nadya A. Mohamed, Joseph R. Cavallaro","doi":"10.1109/SiPS52927.2021.00032","DOIUrl":"https://doi.org/10.1109/SiPS52927.2021.00032","url":null,"abstract":"Sensors are used to monitor various parameters in many real-world applications. Sudden changes in the underlying patterns of the sensors readings may represent events of interest. Therefore, event detection, an important temporal version of outlier detection, is one of the primary motivating applications in sensor networks. This work describes the implementation of a real-time outlier detection that uses an Autoencoder-LSTM neural-network accelerator implemented on the Xilinx PYNQ-Z1 development board. The implemented accelerator consists of a fine-tuned Autoencoder to extract the latent features in sensor data followed by a Long short-term memory (LSTM) network to predict the next step and detect outliers in real-time. The implemented design achieves 2.06 ms minimum latency and 85.9 GOp/s maximum throughput. The low latency and 0.25 W power consumption of the Autoencoder-LSTM outlier detector makes it suitable for resource-constrained computing platforms.","PeriodicalId":103894,"journal":{"name":"2021 IEEE Workshop on Signal Processing Systems (SiPS)","volume":"91 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125589085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"OneAI - Novel Multipurpose Deep Learning Algorithms for UWB Wireless Networks","authors":"A. Abbasi, Huaping Liu","doi":"10.1109/SiPS52927.2021.00031","DOIUrl":"https://doi.org/10.1109/SiPS52927.2021.00031","url":null,"abstract":"In this paper, novel multipurpose deep learning algorithms are proposed for ultra-wideband (UWB) wireless networks that are capable of identifying the channel environment, estimating the SNR level, and performing ToA estimation, simultaneously. UWB technology is among the rapid-growing solutions for the next generation of deep learning-based wireless communication and localization systems. Existing deep learning algorithms for UWB wireless networks have addressed the various signal processing tasks individually in separate deep learning modules. This, however, increases the computational complexity, power consumption, and overall latency of the models. In this paper, unlike the existing methods, the desired signal processing tasks are performed in one single deep learning module. The proposed model consists of a main deep learning module as the core of the model that extracts low-level information from the signal and several shallow learning networks to extract high-level information. We demonstrate that the low-level information that is extracted in the core deep learning module can be reused in all separate tasks. The performance of the proposed models is investigated against the standard IEEE 802.15.4a channel model by evaluating various metrics such as accuracy, area under the curve (AUC), precision, mean absolute error (MAE), and mean square error (MSE).","PeriodicalId":103894,"journal":{"name":"2021 IEEE Workshop on Signal Processing Systems (SiPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129967532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Compressive Estimation of Wideband mmW Channel using Analog True-Time-Delay Array","authors":"Veljko Boljanovic, D. Cabric","doi":"10.1109/SiPS52927.2021.00038","DOIUrl":"https://doi.org/10.1109/SiPS52927.2021.00038","url":null,"abstract":"High-rate directional communication in millimeterwave (mmW) systems requires a fast and accurate channel estimation. Novel array architectures and signal processing techniques are needed to avoid prohibitive estimation overhead associated with large antenna arrays. Recent advancements in hardware design helped the re-emergence of true-time-delay (TTD) arrays whose frequency-dependent beams can be leveraged for low-overhead channel probing and estimation. In this work, we consider an analog TTD array and develop a low-overhead compressive sensing based algorithm for channel estimation in frequency-domain. The algorithm is compared with related state-of-the-art approaches designed for analog phased antenna arrays. Our results reveal the advantages of the proposed TTD-based algorithm in terms of the required number of training symbols, estimation accuracy, and computational complexity.","PeriodicalId":103894,"journal":{"name":"2021 IEEE Workshop on Signal Processing Systems (SiPS)","volume":"111 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121333296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}