{"title":"Energy-Efficient Radix-4 Belief Propagation Polar Code Decoding Using an Efficient Sign-Magnitude Adder and Clock Gating","authors":"O. Meteer, Arvid B. Van Den Brink, M. Bekooij","doi":"10.1109/DSD57027.2022.00026","DOIUrl":"https://doi.org/10.1109/DSD57027.2022.00026","url":null,"abstract":"Polar encoding is the first information coding method that has been proven to achieve channel capacity for binary-input discrete memoryless channels. Since its introduction, much research has been done on improving decoding performance, execution time and energy efficiency. Classic belief propagation uses radix-2 decoding, but a recent study proposed radix-4 decoding which reduces memory usage by 50%. However a drawback is its higher computational complexity, negatively impacting energy usage and throughput. In this paper we present an energy-efficient radix-4 belief propagation polar decoder architecture that uses a new sign-magnitude adder that does not require conversion to two's complement and back. On top of that we also propose using clock gating of input values by checking if all $R$ inputs of the decoder are zero. These two key contributions lead to a more energy -efficient design that is smaller and has higher maximum clock speed and throughput. Post-layout simulation results show that compared to the previously proposed 1024-bit radix-4 belief propagation polar code decoder, our decoder uses between 30.22 % and 32.80 % less power and is 5.2 % smaller at the same clock speed. Also, our design can achieve a 15.7% higher clock speed at which it is still up to 10.76% more power efficient and 4.8% smaller.","PeriodicalId":211723,"journal":{"name":"2022 25th Euromicro Conference on Digital System Design (DSD)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114168466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
B. Ghavami, Mahdi Sajedi, Mohsen Raji, Zhenman Fang, Lesley Shannon
{"title":"A Majority-based Approximate Adder for FPGAs","authors":"B. Ghavami, Mahdi Sajedi, Mohsen Raji, Zhenman Fang, Lesley Shannon","doi":"10.1109/DSD57027.2022.00017","DOIUrl":"https://doi.org/10.1109/DSD57027.2022.00017","url":null,"abstract":"The most advanced ASIC-based approximate adders are focused on gate or transistor level approximating structures. However, due to architectural differences between ASIC and FPGA, comparable performance gains for FPGA-based approximate adders cannot be obtained using ASIC-based approximation ones. In this paper, we propose a method for designing a low-error approximate adder that effectively deploys the modern FPGA structure. We introduce an FPGA-based approximate adder, named as Majority Approximate Adder (MAA), with less error than the advanced approximate adders. MAA is constructed using an approximate part and an accurate one; i.e. the accurate part is based on a smaller carry-chain compared with the carry-chain of the corresponding accurate adder. In addition, approximate part is designed to use FPGA resources efficiently with a low mean error distance (MED). Experimental results based on Monte-Carlo simulation demonstrates that a 16-bit MAA has a 49.92% lower MED than the state of the art FPGA-based approximate adder. MAA also takes up less area and consumes less power than other FPGA-based approximate adders in the literature.","PeriodicalId":211723,"journal":{"name":"2022 25th Euromicro Conference on Digital System Design (DSD)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114908853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Akshat Ramachandran, John L. Gustafson, Anusua Roy, R. Ansari, R. Daruwala
{"title":"PositIV:A Configurable Posit Processor Architecture for Image and Video Processing","authors":"Akshat Ramachandran, John L. Gustafson, Anusua Roy, R. Ansari, R. Daruwala","doi":"10.1109/DSD57027.2022.00022","DOIUrl":"https://doi.org/10.1109/DSD57027.2022.00022","url":null,"abstract":"Image processing is essential for applications such as robot vision, remote sensing, computational photography, augmented reality etc. In the design of dedicated hardware for such applications, IEEE Std 754™ floating point (float) arithmetic units have been widely used. While float-based architectures have achieved favorable results, their hardware is complicated and requires a large silicon footprint. In this paper we propose a Posit-based Image and Video processor (PositIV), a completely pipelined, configurable, image processor using posit arithmetic that guarantees lower power use and smaller silicon footprint than floats. PositIV is able to effectively overlap computation with memory access and supports multidimensional addressing, virtual border handling, prefetching and buffering. It is successfully able to integrate configurability, flexibility, and ease of development with real-time performance characteristics. The performance of PositIV is validated on several image processing algorithms for different configurations and compared against state-of-the-art implementations. Additionally, we empirically demonstrate the superiority of posits in processing images for several conventional algorithms, achieving at least 35–40% improvement in image quality over standard floats.","PeriodicalId":211723,"journal":{"name":"2022 25th Euromicro Conference on Digital System Design (DSD)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114882306","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Technology Mapping for PAIG Optimised Polymorphic Circuits","authors":"R. Ruzicka, Václav Simek","doi":"10.1109/DSD57027.2022.00112","DOIUrl":"https://doi.org/10.1109/DSD57027.2022.00112","url":null,"abstract":"The concept of polymorphic electronics allows to efficiently implement two or more functions in a single circuit. It is characteristic of that approach that the currently selected function from the set of available ones depends on the state of the circuit operating environment. The key components of such circuits are polymorphic gates. Since the introduction of polymorphic electronics, just a few tens of polymorphic gates have been published. However, a large number of them exhibit parameters that fall behind ubiquitous CMOS technology, which makes their utilization for real applications rather difficult. As it turns out, the synthesis of polymorphic circuits achieves a significantly higher degree of complexity in comparison to the ordinary digital circuit. In past, many of the previously reported polymorphic circuits were designed using evolutionary principles (EA, CGP, etc.). It has been shown that the problem of scalable synthesis techniques suitable for large-scale polymorphic circuits could be addressed by the adoption of multi-level synthesis techniques such as And-Inverter-Graphs. The PAIG (Polymorphic And-Inverter-Graphs) concept and synthesis techniques based on it seem to be a promising approach. This paper shows how modern polymorphic gates could be used in combination with a PAIG-based synthesis tool to obtain an efficient implementation of complex polymorphic circuits.","PeriodicalId":211723,"journal":{"name":"2022 25th Euromicro Conference on Digital System Design (DSD)","volume":"28 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125451251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Modular Polynomial Multiplier for NTT Accelerator of Crystals-Kyber","authors":"Yuma Itabashi, Rei Ueno, N. Homma","doi":"10.1109/DSD57027.2022.00076","DOIUrl":"https://doi.org/10.1109/DSD57027.2022.00076","url":null,"abstract":"This paper presents a hardware design that efficiently performs the number theoretic transform (NTT) for lattice-based cryptography. First, we propose an efficient modular multiplication method for lattice-based cryptography defined over Proth numbers. The proposed method is based on a K-RED technique specific to Proth numbers. In particular, we divide the intermediate result into the sign bit and the other absolute value bits and handle them separately to significantly reduce implementation costs. Then, we show a butterfly unit datapath of NTT and inverse INTT equipped with the proposed modular multiplier. We apply the proposed NTT accelerator to Crystals-Kyber, which is lattice-based cryptography, and evaluate its performance on Xilinx Artix-7. The results show that the proposed NTT accelerators achieve up-to 3% and 33% higher area-time efficiency in terms of LUTs and FFs, respectively, than conventional best methods. In addition, the low-latency version of the proposed NTT accelerators achieves a 18% lower-latency with an area-time efficiency (in terms of LUTs, FFs, and DSPs) than the existing fastest method.","PeriodicalId":211723,"journal":{"name":"2022 25th Euromicro Conference on Digital System Design (DSD)","volume":"355 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125640435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hadjer Benmeziane, Hamza Ouranoughi, S. Niar, Kaoutar El Maghraoui
{"title":"CaW-NAS: Compression Aware Neural Architecture Search","authors":"Hadjer Benmeziane, Hamza Ouranoughi, S. Niar, Kaoutar El Maghraoui","doi":"10.1109/DSD57027.2022.00059","DOIUrl":"https://doi.org/10.1109/DSD57027.2022.00059","url":null,"abstract":"With the ever-growing demand for deep learning (DL) at the edge, building small and efficient DL architectures has become a significant challenge. Optimization techniques such as quantization, pruning or hardware-aware neural architecture search (HW-NAS) have been proposed. In this paper, we present an efficient HW-NAS; Compression-Aware Neural Architecture search (CaW-NAS), that combines the search for the architecture and its quantization policy. While former works search over a fully quantized search space, we define our search space with quantized and non-quantized architectures. Our search strategy finds the best trade-off between accuracy and latency according to the target hardware. Experimental results on a mobile platform show that, our method allows to obtain more efficient networks in terms of accuracy, execution time and energy consumption when compared to the state of the art.","PeriodicalId":211723,"journal":{"name":"2022 25th Euromicro Conference on Digital System Design (DSD)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125968592","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Event-Driven Programming of FPGA-accelerated ROS 2 Robotics Applications","authors":"Christian Lienen, M. Platzner","doi":"10.1109/DSD57027.2022.00088","DOIUrl":"https://doi.org/10.1109/DSD57027.2022.00088","url":null,"abstract":"Many applications from the robotics domain can benefit from FPGA acceleration. A corresponding key question is not only how to integrate hardware accelerators into software-centric robotics programming environments but also how to integrate more advanced approaches like dynamic partial reconfiguration. Recently, several approaches have demonstrated hardware acceleration for the robot operating system (ROS), the dominant programming environment in robotics. ROS is a middleware layer that features the composition of complex robotics applications as a set of nodes that communicate via mechanisms such as publish/subscribe, and distributes them over several compute platforms. In this paper, we present a novel approach for event-based programming of robotics applications that leverages dynamic partial reconfiguration and ReconROS, a framework for flexibly mapping ROS 2 nodes to either software or reconfigurable hardware. The approach bases on the ReconROS executor that schedules callbacks of ROS 2 nodes and utilizes a reconfigurable slot model and partial runtime reconfiguration to load hardware-based callbacks on demand. We describe the ReconROS executor approach, give design examples, and experimentally evaluate its functionality with examples.","PeriodicalId":211723,"journal":{"name":"2022 25th Euromicro Conference on Digital System Design (DSD)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124183501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Nonlinear Compression Block Codes Search Strategy","authors":"O. Novák","doi":"10.1109/DSD57027.2022.00094","DOIUrl":"https://doi.org/10.1109/DSD57027.2022.00094","url":null,"abstract":"This paper deals with extending linear compression codes by nonlinear check bits that improve the usability of decompressed patterns for testing circuits with more inputs. The earlier works used a purely random or partially random search of the nonlinear check-bits truth tables to construct the first nonlinear structures. Here, we derive deterministic rules that characterize the relationship among the nonlinear code check bits. The efficiency of the rules is demonstrated on different codes with the number of specified bits equal to three. The code parameters obtained after applying the rules overperform the parameters of the linear codes. Keeping the restrictions makes the search for the check bit truth tables faster and more efficient than can be got by a simple random search. The reached nonlinear block code (136,5,3) is the most efficient code among other loose compression codes.","PeriodicalId":211723,"journal":{"name":"2022 25th Euromicro Conference on Digital System Design (DSD)","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124378212","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Inference Time Reduction of Deep Neural Networks on Embedded Devices: A Case Study","authors":"Isma-Ilou Sadou, Seyed Morteza Nabavinejad, Zhonghai Lu, Masoumeh Ebrahimi","doi":"10.1109/DSD57027.2022.00036","DOIUrl":"https://doi.org/10.1109/DSD57027.2022.00036","url":null,"abstract":"From object detection to semantic segmentation, deep learning has achieved many groundbreaking results in recent years. However, due to the increasing complexity, the execution of neural networks on embedded platforms is greatly hindered. This has motivated the development of several neural network minimisation techniques, amongst which pruning has gained a lot of focus. In this work, we perform a case study on a series of methods with the goal of finding a small model that could run fast on embedded devices. First, we suggest a simple, but effective, ranking criterion for filter pruning called Mean Weight. Then, we combine this new criterion with a threshold-aware layer-sensitive filter pruning method, called T-sensitive pruning, to gain high accuracy. Further, the pruning algorithm follows a structured filter pruning approach that removes all selected filters and their dependencies from the DNN model, leading to less computations, and thus low inference time in lower-end CPUs. To validate the effectiveness of the proposed method, we perform experiments on three different datasets (with 3, 101, and 1000 classes) and two different deep neural networks (i.e., SICK-Net and MobileNet V1). We have obtained speedups of up to 13x on lower-end CPUs (Armv8) with less than 1% drop in accuracy. This satisfies the goal of transferring deep neural networks to embedded hardware while attaining a good trade-off between inference time and accuracy.","PeriodicalId":211723,"journal":{"name":"2022 25th Euromicro Conference on Digital System Design (DSD)","volume":"192 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131474762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"RISC-V Core with Approximate Multiplier for Error-Tolerant Applications","authors":"Anuj Verma, Priyamvada Sharma, B. P. Das","doi":"10.1109/DSD57027.2022.00040","DOIUrl":"https://doi.org/10.1109/DSD57027.2022.00040","url":null,"abstract":"RISC-V is an open-source instruction set architecture with customizable extensions to introduce operations like multiplication, division, atomic functions, and floating-point operations. In this paper, a new approximate multiplier is integrated with RI5CY (CV32E40P) processor, which can perform integer and floating-point multiplication for error-tolerant applications. The multiplication operation is required in various engineering and scientific applications, including image processing, digital signal processing, and many others. The proposed approximate multiplier is based on linear CORDIC (COordinate Rotation Digital Computer) algorithm and implemented by using only shift-add operations. It can perform multiplication and MAC (Multiply and accumulate) operations. The FPGA (Field programmable gate arrays) implementation results and ASIC (Application-specific integrated circuit) synthesis results for the proposed approximate multiplier along with RI5CY core are reported. The proposed design with RI5CY core is implemented on FPGA Xilinx Zedboard, which improves the performance by 20% and reduces power delay product (PDP) by 15.79% over the existing multipliers of the RI5CY core. Moreover, RI5CY core with the proposed approximate multiplier is synthesized using Industrial 130 nm standard cell library (ISCL) and Sub-threshold 130 nm standard cell library (STSCL) in Synopsys DC compiler. In the case of STSCL, RI5CY core with proposed approximate multiplier has 11.76% less power-consumption, 27.27% less delay, and 38.77% PDP compared to the existing multipliers of the RI5CY core.","PeriodicalId":211723,"journal":{"name":"2022 25th Euromicro Conference on Digital System Design (DSD)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131694057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}