{"title":"FPonAP: Implementation of Floating Point Operations on Associative Processors","authors":"Walaa Amer;Mariam Rakka;Fadi Kurdahi","doi":"10.1109/LES.2024.3446912","DOIUrl":"https://doi.org/10.1109/LES.2024.3446912","url":null,"abstract":"The associative processor (AP) is a processing in-memory (PIM) platform that avoids data movement between the memory and the processor by running computations directly in the memory. It is a parallel architecture based on content addressable memory (CAM), allowing it to address data by its content and thus accelerating search and pattern recognition tasks. APs are suggested as a promising solution to the memory wall caused by the data movement bottleneck in traditional Von-Neumann architectures for data-driven applications, such as machine learning. However, modern implementations of the AP still lack support for floating point (FP) operations that are heavily used in the target applications. In this letter, we present a novel implementation of FP operations on the AP and evaluate its performance on the levels of latency and energy, showing that the proposed solution outperforms parallel FP execution on central processing unit and even GPU for large vector sizes.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"16 4","pages":"389-392"},"PeriodicalIF":1.7,"publicationDate":"2024-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142789081","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kevin J. Phillipson;Michael G. Rywalt;Baibhab Chatterjee;Eric M. Schwartz;Greg Stitt
{"title":"Novel Toolset for Efficient Hardwired Micro-Op Translation in Embedded Microarchitectures","authors":"Kevin J. Phillipson;Michael G. Rywalt;Baibhab Chatterjee;Eric M. Schwartz;Greg Stitt","doi":"10.1109/LES.2024.3447695","DOIUrl":"https://doi.org/10.1109/LES.2024.3447695","url":null,"abstract":"Modern SoCs require increasingly complex embedded control deep within their numerous sub-blocks without adding significant die area. This motivated the creation of \u0000<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>\u0000RTL, a novel toolset for systematically designing efficient pipelined implementations of embedded instruction sets originally intended for multicycle execution. \u0000<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>\u0000RTL utilizes hardwired micro-op translation, a technique commonly used in the instruction decoders of large super-scalar microprocessors, however this technique has been overlooked for designing smaller, more efficient embedded microprocessors. Furthermore, the tools to develop instruction decoders with micro-op translation are proprietary and the techniques are trade secrets. The \u0000<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>\u0000RTL toolset is open-source and this letter clearly presents the methodology. The methodology emphasizes direct opcode decoding from multiple synthesized Verilog blocks versus traditional microprogramming which uses sequential decoding from a ROM. Our results show that a pipelined \u0000<inline-formula> <tex-math>$mu $ </tex-math></inline-formula>\u0000RTL microarchitecture achieves a 21.8% reduction in size compared to a hardwired multicycle implementation of the same instruction set. Additionally, the performance of 0.75 DMIPS/MHz surpasses the RISC-V PicoRV32 by 44.2% and the AVR RISC by 82.9%. These improvements in performance, power, and area are of interest to embedded system architects.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"16 4","pages":"373-376"},"PeriodicalIF":1.7,"publicationDate":"2024-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142789082","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Amin Sarihi;Ahmad Patooghy;Peter Jamieson;Abdel-Hameed A. Badawy
{"title":"Hiding in Plain Sight: Reframing Hardware Trojan Benchmarking as a Hide&Seek Modification","authors":"Amin Sarihi;Ahmad Patooghy;Peter Jamieson;Abdel-Hameed A. Badawy","doi":"10.1109/LES.2024.3443155","DOIUrl":"https://doi.org/10.1109/LES.2024.3443155","url":null,"abstract":"This letter focuses on advancing security research in the hardware design space by formally defining the realistic problem of hardware Trojan (HT) detection. The goal is to model HT detection more closely to the real world, i.e., describing the problem as “The Seeker’s Dilemma” where a detecting agent is unaware of whether circuits are infected by HTs or not. Using this theoretical problem formulation, we create a benchmark that consists of a mixture of HT-free and HT-infected restructured circuits while preserving their original functionalities. The restructured circuits are randomly infected by HTs, causing a situation where the defender is uncertain if a circuit is infected or not. We believe that our innovative benchmark and methodology of creating benchmarks will help the community judge the detection quality of different methods by comparing their success rates in circuit classification. We use our developed benchmark to evaluate three state-of-the-art HT detection tools to show baseline results for this approach. We use principal component analysis to assess the strength of our benchmark, where we observe that some restructured HT-infected circuits are mapped closely to HT-free circuits, leading to significant label misclassification by detectors.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"16 4","pages":"361-364"},"PeriodicalIF":1.7,"publicationDate":"2024-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142789108","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sepehr Tabrizchi;Brendan C. Reidy;Deniz Najafi;Shaahin Angizi;Ramtin Zand;Arman Roohi
{"title":"ViTSen: Bridging Vision Transformers and Edge Computing With Advanced In/Near-Sensor Processing","authors":"Sepehr Tabrizchi;Brendan C. Reidy;Deniz Najafi;Shaahin Angizi;Ramtin Zand;Arman Roohi","doi":"10.1109/LES.2024.3449240","DOIUrl":"https://doi.org/10.1109/LES.2024.3449240","url":null,"abstract":"This letter introduces \u0000<monospace>ViTSen</monospace>\u0000, optimizing vision transformers (ViTs) for resource-constrained edge devices. It features an in-sensor image compression technique to reduce data conversion and transmission power costs effectively. Further, \u0000<monospace>ViTSen</monospace>\u0000 incorporates a ReRAM array, allowing efficient near-sensor analog convolution. This integration, novel pixel reading, and peripheral circuitry decrease the reliance on analog buffers and converters, significantly lowering power consumption. To make ViTSen compatible, several established ViT algorithms have undergone quantization and channel reduction. Circuit-to-application co-simulation results show that \u0000<monospace>ViTSen</monospace>\u0000 maintains accuracy comparable to a full-precision baseline across various data precisions, achieving an efficiency of ~3.1 TOp/s/W.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"16 4","pages":"341-344"},"PeriodicalIF":1.7,"publicationDate":"2024-12-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142788995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analysis of S-Box Hardware Resources to Improve AES Intrinsic Security Against Power Attacks","authors":"Thockchom Birjit Singha;Roy Paily Palathinkal;Shaik Rafi Ahamed","doi":"10.1109/LES.2024.3478070","DOIUrl":"https://doi.org/10.1109/LES.2024.3478070","url":null,"abstract":"Side-channel attacks (SCAs) have rendered Internet of Things (IoT)-based devices unsafe despite employing Advanced Encryption Standard (AES) as the cryptographic algorithm. Additional circuitry, called countermeasures, is used to protect AES against the attacks at the cost of huge area and power overheads. The attacks are performed on SubBytes round operation of AES, which comprises of 16 S-boxes. This letter makes a novel attempt to boost the intrinsic security of an unprotected AES by analyzing four smallest composite field arithmetic (CFA)-based S-boxes available in literature, Circuit Minimization Team (CMT), Canright, Maximov, and Masoleh with lookup table (LUT)-based S-box as a reference. This letter proposes an AES design which is unprotected but with enhanced security. The designer can aim higher security by adding smaller countermeasure protective schemes before incorporating into IoT devices. A novel 3-D hardware analysis, namely, hardware resources, hardware complexity/linearity, and hardware security, is performed which demonstrates that lesser gate equivalent (GE) and linear gates of Masoleh S-box offer the highest security. Upon evaluation on Side-Channel Attack Standard Evaluation Board (SASEBO), all the hardware security metrics favored Masoleh S-box, depicting nearly \u0000<inline-formula> <tex-math>$94 times $ </tex-math></inline-formula>\u0000 gain in security and 80% reduction in area with respect to other unprotected designs.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"16 4","pages":"525-528"},"PeriodicalIF":1.7,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142777725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Implementation of Polyphase Digital Down Converter Using Optimized LMS Algorithm for WCDMA Application","authors":"Debarshi Datta;Mrinal Kanti Naskar","doi":"10.1109/LES.2024.3473539","DOIUrl":"https://doi.org/10.1109/LES.2024.3473539","url":null,"abstract":"This letter presents the implementation of a polyphase digital down converter (DDC) that employs a least mean square (LMS) algorithm associated with particle swarm optimization (PSO) for the wideband code division multiple access (WCDMA) application. The PSO-based LMS algorithm suppresses the noise signal, enabling a significant improvement in the spurious-free dynamic range (SFDR), which is 130 dB. The complex multiplication is realized by the canonical impel-mentation to reduce the number of multipliers. The suggested polyphase DDC architecture is successfully implemented in the field-programmable gate array device (FPGA) Kintex-7 platform. To achieve high accuracy, the proposed design is implemented with an efficient user-defined floating-point representation data type. Synthesis results suggested that the design consumes less area and power compared to the most recent structure.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"16 4","pages":"533-536"},"PeriodicalIF":1.7,"publicationDate":"2024-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142777800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abdollah Masoud Darya;Sohaib Majzoub;Ali A. El-Moursy;Mohamed Wed Eladham;Khalid Javeed;Ahmed S. Elwakil
{"title":"Using Intermittent Chaotic Clocks to Secure Cryptographic Chips","authors":"Abdollah Masoud Darya;Sohaib Majzoub;Ali A. El-Moursy;Mohamed Wed Eladham;Khalid Javeed;Ahmed S. Elwakil","doi":"10.1109/LES.2024.3472709","DOIUrl":"https://doi.org/10.1109/LES.2024.3472709","url":null,"abstract":"This letter proposes using intermittent chaotic clocks, generated from chaotic maps, to drive cryptographic chips running the advanced encryption standard as a countermeasure against correlation power analysis (CPA) attacks. Five different chaotic maps, namely, the logistic map, the Bernoulli shift map, the Henon map, the tent map, and the Ikeda map, are used in this letter to generate chaotic clocks. The performance of these chaotic clocks is evaluated in terms of timing overhead and the resilience of the driven chip against CPA attacks. All proposed chaotic clocking schemes successfully protect the driven chip against attacks, with the clocks produced by the optimized Ikeda, Henon, and logistic maps achieving the lowest-timing overhead. These optimized maps, due to their intermittent chaotic behavior, exhibit lower-timing overhead compared to previous work. Notably, the chaotic clock generated by the optimized Ikeda map approaches the theoretical limit of timing overhead, i.e., half the execution time of a reference periodic clock.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"16 4","pages":"529-532"},"PeriodicalIF":1.7,"publicationDate":"2024-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142777823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MQTT-Based Adaptive Estimation Over Distributed Network Using Raspberry Pi Pico W","authors":"Prantaneel Debnath;Anshul Gusain;Parth Sharma;Pyari Mohan Pradhan","doi":"10.1109/LES.2024.3473017","DOIUrl":"https://doi.org/10.1109/LES.2024.3473017","url":null,"abstract":"As the demand for edge computing applications continues to rise, the need for efficient training of resource-constrained devices becomes paramount. This letter proposes message queuing telemetry transport (MQTT)-based implementation of distributed estimation strategies in the context of the Internet of Things (IoT), namely incremental, consensus, and diffusion strategies. The use of Raspberry Pi Pico W in the emulation environment is motivated by its advanced capability, while the MQTT data protocol is employed to address the constraints associated with conventional HTTP/HTTPs protocols. Synchronization in an IoT network is achieved by the integration of a novel methodology that entails the use of the wait-for-slowest (WFS) protocol and the MQTT protocol. Furthermore, the development of a graphical user interface supported by the Django application allows for adjusting parameters in distributed strategies through the HTTP REST API, along with SQLite. The results acquired from hardware experiments exhibit a strong correlation between the mean-square performance achieved from simulation studies. The distributed estimation strategy is compared with state-of-the art centralized and noncooperation estimation strategies, demonstrating its superior performance. In addition, a study is conducted on the resilience of these IoT networks in the face of several network threats, such as node failure and model poisoning attacks. A theoretical analysis is provided to explain the relationship between the number of iterations and node failure.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"16 4","pages":"517-520"},"PeriodicalIF":1.7,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142777798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Provably Secure Scheme to Prevent Master Key Recovery by Fault Attack on AES Hardware","authors":"Sneha Swaroopa;Sivappriya Manivannan;Rajat Subhra Chakraborty;Indrajit Chakrabarti","doi":"10.1109/LES.2024.3472673","DOIUrl":"https://doi.org/10.1109/LES.2024.3472673","url":null,"abstract":"We explore a relatively lightweight scheme to prevent key recovery by fault attacks on the advanced encryption standard (AES) cipher. We employ a transformed key (derived from the original key through a nonlinear and possibly one-way mapping) for AES encryption hardware. The mapping combines processing using a pseudorandom bitstream generator (the keystream generator of the Grain-128a stream cipher), followed by a self-shrinking generator (SSG). We provide formal proof of security of the scheme, based on the assumed difficulty of inverting the output of the proposed key transformer. The design of the key transformer ensures that it is itself resistant to fault-attack. Our scheme requires a 96-bit secret initial value (IV), a one-time initial latency (approximately 256 clock cycles for a 128-bit key) of generating the transformed key, and a key transformation layer. However, the core AES hardware is left unchanged. We present hardware platform-based experimental results for an FPGA implementation, which incurs less hardware overhead than previously proposed fault attack prevention/detection schemes.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"16 4","pages":"521-524"},"PeriodicalIF":1.7,"publicationDate":"2024-10-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142777518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Phu Khanh Huynh;Ilknur Mustafazade;Francky Catthoor;Nagarajan Kandasamy;Anup Das
{"title":"A Scalable Dynamic Segmented Bus Interconnect for Neuromorphic Architectures","authors":"Phu Khanh Huynh;Ilknur Mustafazade;Francky Catthoor;Nagarajan Kandasamy;Anup Das","doi":"10.1109/LES.2024.3452551","DOIUrl":"https://doi.org/10.1109/LES.2024.3452551","url":null,"abstract":"Large-scale neuromorphic architectures consist of computing tiles that communicate spikes using a shared interconnect. We propose ADIONA, a dynamic segmented bus interconnect to address design scalability while reducing energy and latency of spike traffic. ADIONA consists of parallel bus lanes arranged in a ladder-shaped structure that allows any tile to connect to another, offers multiple routing options for communication links, and provides a high level of customization for different mapping scenarios and use cases. Each lane in the ladder bus is partitioned into segments using lightweight bufferless switches. Based on compile-time communication information, these switches can be dynamically reconfigured at runtime to execute the target application. Our dynamic segmented bus interconnect substantially enhances hardware utilization, improves fault tolerance, and offers adaptability to execute different applications on a single hardware platform. We evaluate ADIONA using three synthetic and three realistic machine learning workloads on a cycle-accurate neuromorphic simulator. Our results show that ADIONA reduces energy consumption by \u0000<inline-formula> <tex-math>$2.1times $ </tex-math></inline-formula>\u0000, latency by \u0000<inline-formula> <tex-math>$40times $ </tex-math></inline-formula>\u0000, and interconnect area by \u0000<inline-formula> <tex-math>$2times $ </tex-math></inline-formula>\u0000, compared to a state-of-the-art interconnect for neuromorphic systems.","PeriodicalId":56143,"journal":{"name":"IEEE Embedded Systems Letters","volume":"16 4","pages":"505-508"},"PeriodicalIF":1.7,"publicationDate":"2024-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142777790","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}