Vishesh Mishra, Sparsh Mittal, Rekha Singhal, M. Nambiar
{"title":"Novel, Configurable Approximate Floating-point Multipliers for Error-Resilient Applications","authors":"Vishesh Mishra, Sparsh Mittal, Rekha Singhal, M. Nambiar","doi":"10.1109/ISQED57927.2023.10129296","DOIUrl":"https://doi.org/10.1109/ISQED57927.2023.10129296","url":null,"abstract":"By exploiting the gap between the user’s accuracy requirement and the hardware’s accuracy capability, approximate circuit design offers enormous gains in efficiency for a minor accuracy loss. In this paper, we propose two approximate floating point multipliers (AxFPMs), named DTCL (decomposition, truncation and chunk-level leading-one quantization) and TDIL (truncation, decomposition and ignoring LSBs). Both AxFPMs introduce approximation in mantissa multiplication. DTCL works by rounding and truncating LSBs and quantizing each chunk. TDIL works by truncating LSBs and ignoring the least important terms in the multiplication. Further, both techniques multiply more significant terms by simply exponent addition or shift-and-add operations. These AxFPMs are configurable and allow trading off accuracy with hardware overhead. Compared to exact floating-point multiplier (FPM), DTCL(4,8,8) reduces area, energy and delay by 11.0%, 69% and 61%, respectively, while incurring a mean relative error of only 2.37%. On a range of approximate applications from machine learning, deep learning and image processing domains, our AxFPMs greatly improve efficiency with only minor loss in accuracy. For example, for image sharpening and Gaussian smoothing, all DTCL and TDIL variants achieve a PSNR of more than 30dB. The source-code is available at https://github.com/CandleLabAI/ApproxFloatingPointMultiplier.","PeriodicalId":315053,"journal":{"name":"2023 24th International Symposium on Quality Electronic Design (ISQED)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125648148","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analysis of Pattern-dependent Rapid Thermal Annealing Effects on SRAM Design","authors":"Vidya A. Chhabria, S. Sapatnekar","doi":"10.1109/ISQED57927.2023.10129399","DOIUrl":"https://doi.org/10.1109/ISQED57927.2023.10129399","url":null,"abstract":"Rapid thermal annealing (RTA) is an important step in semiconductor manufacturing. RTA-induced variability due to differences in die layout patterns can significantly contribute to transistor parameter variations, resulting in degraded chip performance and yield. The die layout patterns that drive these variations are related to the distribution of the density of transistors (silicon) and shallow trench isolation (silicon dioxide) across the die, which result in emissivity variations that change the die surface temperature during annealing. While prior art has developed pattern-dependent simulators and provided mitigation techniques for digital design, it has failed to consider the impact of the temperature-dependent thermal conductivity of silicon on RTA effects and has not analyzed the effects on memory. This work develops a novel 3D transient pattern-dependent RTA simulation methodology that accounts for the dependence of the thermal conductivity of silicon on temperature. The simulator is used to both analyze the effects of RTA on memory performance and to propose mitigation strategies for a 7nm FinFET SRAM design. It is shown that RTA effects degrade read and write delays by 16% and 20% and read static noise margin (SNM) by 15%, and the applied mitigation strategies can compensate for these degradations at the cost of a 16% increase in area for a 7.5% tolerance in SNM margin.","PeriodicalId":315053,"journal":{"name":"2023 24th International Symposium on Quality Electronic Design (ISQED)","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124643030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. PrashanthH., S. SrinikethS., Shrikrishna Hebbar, R. Chinmaye, M. Rao
{"title":"SQRTLIB : Library of Hardware Square Root Designs","authors":"C. PrashanthH., S. SrinikethS., Shrikrishna Hebbar, R. Chinmaye, M. Rao","doi":"10.1109/ISQED57927.2023.10129377","DOIUrl":"https://doi.org/10.1109/ISQED57927.2023.10129377","url":null,"abstract":"Square-root is an elementary arithmetic function that is utilized not only for image and signal processing applications but also to extract vector functionalities. The square-root module demands high energy and hardware resources, apart from being a complex design to implement. In the past, many techniques, including Iterative, New Non-Restoring (New-NR), CORDIC, Piece-wise-linear (PWL) approximation, Look-Up-Tables (LUTs), Digit-by-digit based integer (Digit-Int) format and fixed-point (Digit-FP) format implementations were reported to realize square-root function. Cartesian genetic programming (CGP) is a type of evolutionary algorithm that is suggested to evolve circuits by exploring a large solution space. This paper attempts to develop a library of square-root circuits ranging from 2-bits to 8-bits and also benchmark the proposed CGP evolved square-root circuits with the other hardware implementations. All designs were analyzed using both FPGA and ASIC (130 nm Skywater node) flow to characterize hardware parameters and evaluated using various error metrics. Among all the implementations, CGP-derived square-root designs of fixed-point format offered the best trade-off between hardware and error characteristics. All novel designs of this work are made freely available in [1] for further research and development usage.","PeriodicalId":315053,"journal":{"name":"2023 24th International Symposium on Quality Electronic Design (ISQED)","volume":"143 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124919717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sathwika Bavikadi, Purab Ranjan Sutradhar, A. Ganguly, Sai Manoj Pudukotai Dinakarrao
{"title":"Heterogeneous Multi-Functional Look-Up-Table-based Processing-in-Memory Architecture for Deep Learning Acceleration","authors":"Sathwika Bavikadi, Purab Ranjan Sutradhar, A. Ganguly, Sai Manoj Pudukotai Dinakarrao","doi":"10.1109/ISQED57927.2023.10129338","DOIUrl":"https://doi.org/10.1109/ISQED57927.2023.10129338","url":null,"abstract":"Emerging applications including deep neural networks (DNNs) and convolutional neural networks (CNNs) employ massive amounts of data to perform computations and data analysis. Such applications often lead to resource constraints and impose large overheads in data movement between memory and compute units. Several architectures such as Processing-in-Memory (PIM) are introduced to alleviate the bandwidth bottlenecks and inefficiency of traditional computing architectures. However, the existing PIM architectures represent a trade-off between power, performance, area, energy efficiency, and programmability. To better achieve the energy-efficiency and flexibility criteria simultaneously in hardware accelerators, we introduce a multi-functional look-up-table (LUT)-based reconfigurable PIM architecture in this work. The proposed architecture is a many-core architecture, each core comprises processing elements (PEs), a stand-alone processor with programmable functional units built using high-speed reconfigurable LUTs. The proposed LUTs can perform various operations, including convolutional, pooling, and activation that are required for CNN acceleration. Additionally, the proposed LUTs are capable of providing multiple outputs relating to different functionalities simultaneously without the need to design different LUTs for different functionalities. This leads to optimized area and power overheads. Furthermore, we also design special-function LUTs, which can provide simultaneous outputs for multiplication and accumulation as well as special activation functions such as hyperbolics and sigmoids. We have evaluated various CNNs such as LeNet, AlexNet, and ResNet18,34,50. Our experimental results have demonstrated that when AlexNet is implemented on the proposed architecture shows a maximum of 200× higher energy efficiency and 1.5× higher throughput than a DRAM-based LUT-based PIM architecture.","PeriodicalId":315053,"journal":{"name":"2023 24th International Symposium on Quality Electronic Design (ISQED)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124787951","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Online Training from Streaming Data with Concept Drift on FPGAs","authors":"Esther Roorda, S. Wilton","doi":"10.1109/ISQED57927.2023.10129312","DOIUrl":"https://doi.org/10.1109/ISQED57927.2023.10129312","url":null,"abstract":"In dynamic environments, the inputs to machine learning models may exhibit statistical changes over time, through what is called concept drift. Incremental training can allow machine learning models to adapt to changing conditions and maintain high accuracy by continuously updating network parameters. In the context of FPGA-based accelerators however, online incremental learning is challenging due to resource and communication constraints, as well as the absence of labelled training data. These challenges have not been fully evaluated or addressed in existing research. In this paper, we present and evaluate strategies for performing incremental training on streaming data with concept drift on FPGA-based platforms. We first present FPGA-based implementations of existing training algorithms to demonstrate the viability of online training with concept shift and to evaluate design tradeoffs. We then propose a technique for online training without labelled data and demonstrate its potential in the context of FPGA-based hardware acceleration.","PeriodicalId":315053,"journal":{"name":"2023 24th International Symposium on Quality Electronic Design (ISQED)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130706785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enlarging Reliable Pairs via Inter-Distance Offset for a PUF Entropy-Boosting Algorithm","authors":"Md. Omar Faruque, Wenjie Che","doi":"10.1109/ISQED57927.2023.10129308","DOIUrl":"https://doi.org/10.1109/ISQED57927.2023.10129308","url":null,"abstract":"Physically Unclonable Functions (PUFs) are emerging hardware security primitives that leverage random variations during chip fabrication to generate unique secrets. The amount of random secrets that can be extracted from a limited number of physical PUF components can be measured by entropy bits. Existing strategies of pairing or grouping N RO-PUF elements have an entropy upper bound limited by log2(N!) or O(N•log2(N)). A recently proposed entropy boosting technique [9] improves the entropy bits to be quadratically large at N(N-1)/2 or O(N^2), significantly improved the RO-PUF hardware utilization efficiency in generating secrets. However, the improved amount of random secrets comes at the cost of discarding a large portion of unreliable bits. In this paper, we propose an \"Inter-Distance Offset (IDO)\" technique that converts those unreliable pairs to be reliable by adjusting the pair inter-distance to an appropriate range. Theoretical analysis of the ratio of converted unreliable bits is provided along with experimental validations. Experimental evaluations on reliability, Entropy and reliability tradeoffs are given using real RO PUF datasets in [10]. Information leakage is analyzed and evaluated using PUF datasets to identify those offset ranges that leak no information. The proposed technique improves the portion of reliable (quadratically large) entropy bits by 20% and 100% respectively for different offset ranges. Hardware implementation on FPGAs demonstrates that the proposed technique is lightweight in implementation and runtime.","PeriodicalId":315053,"journal":{"name":"2023 24th International Symposium on Quality Electronic Design (ISQED)","volume":"92 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122029368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Simon Friedrich, Shambhavi Balamuthu Sampath, R. Wittig, M. Vemparala, Nael Fasfous, E. Matús, W. Stechele, G. Fettweis
{"title":"Lightweight Instruction Set for Flexible Dilated Convolutions and Mixed-Precision Operands","authors":"Simon Friedrich, Shambhavi Balamuthu Sampath, R. Wittig, M. Vemparala, Nael Fasfous, E. Matús, W. Stechele, G. Fettweis","doi":"10.1109/ISQED57927.2023.10129341","DOIUrl":"https://doi.org/10.1109/ISQED57927.2023.10129341","url":null,"abstract":"Modern deep neural networks specialized for object detection and semantic segmentation require specific operations to increase or preserve the resolution of their feature maps. Hence, more generic convolution layers called transposed and dilated convolutions are employed, adding a large number of zeros between the elements of the input features or weights. Usually, standard neural network hardware accelerators process these convolutions in a straightforward manner, without paying attention to the added zeros, resulting in an increased computation time. To cope with this problem, recent works propose to skip the redundant elements with additional hardware or solve the problem efficiently only for a limited range of dilation rates. We present a general approach for accelerating transposed and dilated convolutions that does not introduce any hardware overhead while supporting all dilation rates. To achieve this, we introduce a novel precision-scalable lightweight instruction set and memory scheme that can be applied to the different convolution variants. This results in a speed-up of 5 times in DeepLabV3+ outperforming the recently proposed design methods. The support of precision-scalable execution of all workloads further increases the speedup in computation time shown for the PointPillars, DeepLabV3+, and ENet networks. Compared to the state-of-the-art commercial EdgeTPU, the instruction footprint of ResNet-50 of our designed accelerator is reduced by 60 percent.","PeriodicalId":315053,"journal":{"name":"2023 24th International Symposium on Quality Electronic Design (ISQED)","volume":" 33","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120832453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DSEAdd: FPGA based Design Space Exploration for Approximate Adders with Variable Bit-precision","authors":"Archie Mishra, N. Rao","doi":"10.1109/ISQED57927.2023.10129364","DOIUrl":"https://doi.org/10.1109/ISQED57927.2023.10129364","url":null,"abstract":"Functional approximation methods have been used to exploit the inherent error tolerance of several applications. Approximate computing reduces the resources utilized at the cost of acceptable accuracy loss. Designers need to follow a systematic approach to arrive at an optimized design configuration based on certain constraints. In this work, we present DSEAdd: an FPGA-based automated design space exploration (DSE) framework targeting variable bit-width approximate adders. Given a certain area, timing or accuracy (ATA) constraint, the approach helps to identify the best adder configuration. We introduce a metric known as Figure of Merit (FOM) to quantify the area, performance and accuracy of the design. We test the DSE framework by running a set of 74 design configurations. We demonstrate the use of FOM as a metric to choose the best adder configuration. We observe that we can obtain an area-optimized design with a 9.7% reduction in resource usage at the cost of only 0.3% accuracy, but with a lower bit precision (8-bit instead of 32-bits). Further, at low bit precisions, a slight compromise in the area (0.35%) can help improve the accuracy dramatically (17.7%). To achieve the best trade-off between accuracy and resources, we propose a configuration with 2 or 3 sub-adders. Lastly, we note that a performance-optimized design is difficult to achieve at higher bit-precision.","PeriodicalId":315053,"journal":{"name":"2023 24th International Symposium on Quality Electronic Design (ISQED)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128778345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ISQED 2023 Best Papers","authors":"","doi":"10.1109/isqed57927.2023.10129392","DOIUrl":"https://doi.org/10.1109/isqed57927.2023.10129392","url":null,"abstract":"","PeriodicalId":315053,"journal":{"name":"2023 24th International Symposium on Quality Electronic Design (ISQED)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128173383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}