Hongfei Wang, Wei Liu, Wenjie Cai, Yunxiao Lu, Caixue Wan
{"title":"Efficient Attacks on Strong PUFs via Covariance and Boolean Modeling","authors":"Hongfei Wang, Wei Liu, Wenjie Cai, Yunxiao Lu, Caixue Wan","doi":"10.1145/3687469","DOIUrl":"https://doi.org/10.1145/3687469","url":null,"abstract":"The physical unclonable function (PUF) is a widely used hardware security primitive. Before hacking into a PUF-protected system, intruders typically initiate attacks on the PUF as the first step. Many strong PUF designs have been proposed to thwart non-invasive attacks that exploit acquired CRPs. In this work, we propose a general framework for efficient attacks on strong PUFs by investigating from two perspectives, namely, statistical covariances in the challenge space and the design dependency among PUF compositions. The framework consists of two novel attack methods against a wide range of PUF families, including XOR APUFs, interpose PUFs, and bistable ring (BR)-PUFs. It can also exploit the knowledge of reliability information to improve attack efficiency with gradient optimization. We evaluate our proposed attacks through extensive experiments, running both software-based simulation and hardware implementations on FPGAs to compare with corresponding SOTA works. Considerable effort has been made in ensuring identical software/hardware conditions for a fair comparison. The results demonstrate that our framework significantly outperforms SOTA results. Moreover, we show that our framework can efficiently attack diverse PUF families built from entirely different types, while almost all existing works solely focused on attacking one or very limited number of PUF designs.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141927072","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Changxu Liu, Hao Zhou, Patrick Dai, Li Shang, Fan Yang
{"title":"PriorMSM: An Efficient Acceleration Architecture for Multi-Scalar Multiplication","authors":"Changxu Liu, Hao Zhou, Patrick Dai, Li Shang, Fan Yang","doi":"10.1145/3678006","DOIUrl":"https://doi.org/10.1145/3678006","url":null,"abstract":"\u0000 Multi-Scalar Multiplication (MSM) is a computationally intensive task that operates on elliptic curves based on\u0000 GF\u0000 (\u0000 P\u0000 ). It is commonly used in Zero-knowledge proof (ZKP), where it accounts for a significant portion of the computation time required for proof generation. In this paper, we present PriorMSM, an efficient acceleration architecture for MSM. We propose a Priority-based Scheduling Mechanism (PBSM) based on a multi-FIFOs and multi-banks architecture to accelerate the implementation of MSM. By increasing the pairing success rate of internal points, PBSM reduces the number of bubbles in the pipeline of point addition (PADD), consequently improving the data throughput of the pipeline. We also introduce an advanced parallel bucket aggregation algorithm, leveraging PADD’s fully pipelined characteristics to significantly accelerate the implementation of bucket aggregation. We perform a sensitivity analysis on the crucial parameter, window size, in MSM. The results indicate that the window size of the MSM significantly impacts its latency. Area-Time Product (ATP) metric is introduced to guide the selection of the optimal window size, balancing the performance and cost for practical applications of subsequent MSM implementations. PriorMSM is evaluated using the TSMC 28nm process. It achieves a maximum speedup of 10.9 × compared to the previous custom hardware implementations and a maximum speedup of 3.9 × compared to the GPU implementations.\u0000","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141652667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-Stream Scheduling of Inference Pipelines on Edge Devices - a DRL Approach","authors":"Danny Pereira, Sumana Ghosh, Soumyajit Dey","doi":"10.1145/3677378","DOIUrl":"https://doi.org/10.1145/3677378","url":null,"abstract":"\u0000 Low-power edge devices equipped with Graphics Processing Units (GPUs) are a popular target platform for real-time scheduling of inference pipelines. Such application-architecture combinations are popular in Advanced Driver-Assistance Systems (ADAS) for aiding in the real-time decision-making of automotive controllers. However, the real-time throughput sustainable by such inference pipelines is limited by resource constraints of the target edge devices. Modern GPUs, both in edge devices and workstation variants, support the facility of concurrent execution of computation kernels and data transfers using the primitive of\u0000 streams\u0000 , also allowing for the assignment of priority to these streams. This opens up the possibility of executing computation layers of inference pipelines within a multi-priority, multi-stream environment on the GPU. However, manually co-scheduling such applications while satisfying their throughput requirement and platform memory budget may require an unmanageable number of profiling runs. In this work, we propose a Deep Reinforcement Learning (DRL) based method for deciding the start time of various operations in each pipeline layer while optimizing the latency of execution of inference pipelines as well as memory consumption. Experimental results demonstrate the promising efficacy of the proposed DRL approach in comparison with the baseline methods, particularly in terms of real-time performance enhancements, schedulability ratio, and memory savings. We have additionally assessed the effectiveness of the proposed DRL approach using a real-time traffic simulation tool IPG CarMaker.\u0000","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141658363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaoqian Wu, Huaxiao Liu, Peng Wang, Lei Liu, Zhenxue He
{"title":"A Power Optimization Approach for Large-scale RM-TB Dual Logic Circuits Based on an Adaptive Multi-Task Intelligent Algorithm","authors":"Xiaoqian Wu, Huaxiao Liu, Peng Wang, Lei Liu, Zhenxue He","doi":"10.1145/3677033","DOIUrl":"https://doi.org/10.1145/3677033","url":null,"abstract":"Logic synthesis is a crucial step in integrated circuit design, and power optimization is an indispensable part of this process. However, power optimization for large-scale Mixed Polarity Reed-Muller (MPRM) logic circuits is an NP-hard problem. In this paper, we divide Boolean circuits into small-scale circuits based on the idea of divide and conquer using the proposed Dynamic Adaptive Grouping Strategy (DAGS) and the proposed circuit decomposition model. Each small-scale Boolean circuit is transformed into an MPRM logic circuit by a polarity transformation algorithm. Based on the gate-level integration, we integrate small-scale circuits into an MPRM and Boolean Dual Logic (RBDL) circuit. Furthermore, the power optimization problem of RBDL circuits is a multi-task, multi-extremal, high-dimensional combinatorial optimization problem, for which we propose an Adaptive Multi-task Intelligent Algorithm (AMIA), which includes global task optimization, population reproduction, valuable knowledge transfer, and local exploration to search for the lowest power for RBDL circuits. Moreover, based on the proposed Fast Power Decomposition Algorithm (FPDA), we proposed a Power Optimization Approach (POA) for an RBDL circuit with the lowest power using the AMIA. Experimental results based on Microelectronics Center of North Carolina (MCNC) Benchmark test circuits demonstrate the effectiveness and superiority of the POA compared to state-of-the-art power optimization approaches.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141659979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Devleena Ghosh, Sumana Ghosh, Ansuman Banerjee, R. Gajavelly, Sudhakar Surendran
{"title":"MAB-BMC: A Formal Verification Enhancer by Harnessing Multiple BMC Engines Together","authors":"Devleena Ghosh, Sumana Ghosh, Ansuman Banerjee, R. Gajavelly, Sudhakar Surendran","doi":"10.1145/3675168","DOIUrl":"https://doi.org/10.1145/3675168","url":null,"abstract":"In recent times, Bounded Model Checking (BMC) engines have gained wide prominence in formal verification. Different BMC engines exist, differing in their optimization, representations and solving mechanisms used to represent and navigate the underlying state transition of the given design to be verified. The objective of this paper is to examine if combinations of BMC engines can help to combine their strengths. We propose an approach that can create a sequencing of BMC engines that can reach better depth in formal verification, as opposed to executing them alone for a specified time. Our approach uses machine learning, specifically, the Multi-Armed Bandit paradigm of reinforcement learning, to predict the best-performing BMC engine for a given unrolling depth of the underlying circuit design. We evaluate our approach on a set of benchmark designs from the Hardware Model Checking Competition (HWMCC) benchmarks and show that it outperforms the state-of-the-art BMC engines in terms of the depth reached or time taken to deduce a property violation. The synthesized BMC engine sequences reach better depths than HWMCC results and the state-of-the-art technique, super_deep, for more than 80% of the cases. It also outperforms single engine runs for more than 92% of the cases where a property violation is not found within a given time duration. For designs where property violations are found within the given time duration, the synthesized sequences found the property violation in a lesser time than HWMCC for all the designs and outperformed both super_deep and single engine runs for more than 87% of the designs.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":2.2,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141685856","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Single Bitline Highly Stable, Low Power With High Speed Half-Select Disturb Free 11T SRAM Cell","authors":"Lokesh Soni, Neeta Pandey","doi":"10.1145/3653675","DOIUrl":"https://doi.org/10.1145/3653675","url":null,"abstract":"<p>A half-select disturb-free 11T (HF11T) static random access memory (SRAM) cell with low power, better stability and high speed is presented in this paper. The proposed SRAM cell works well with bit-interleaving design, which enhances soft-error immunity. A comparison of the proposed HF11T cell with other cutting-edge designs such as single-ended HS free 11T (SEHF11T), a shared-pass-gate 11T (SPG11T), data-dependent stack PMOS switching 10T (DSPS10T), a single-ended half-selected robust 12T (HSR12T), and 11T SRAM cells has been made. It exhibits 4.85 × /9.19 × less read delay (<i>T<sub>RA</sub></i>) and write delay (<i>T<sub>WA</sub></i>), respectively as compared to other considered SRAM cells. It achieves 1.07 × /1.02 × better read and write stability, respectively than the considered SRAM cells. It shows maximum reduction of 1.68 × /4.58 × /94.72 × /9 × /145 × leakage power, read power, write power consumption, read power delay product (PDP) and write PDP respectively, than the considered SRAM cells. In addition, the proposed HF11T cell achieves 10.14 × higher <i>I<sub>on</sub></i>/<i>I<sub>off</sub></i> ratio than the other compared cells. These improvements come with a trade-off, resulting in 1.13 × more <i>T<sub>RA</sub></i> compared to SPG11T. The simulation is performed with Cadence Virtuoso 45nm CMOS technology at supply voltage (<i>V<sub>DD</sub></i>) of 0.6 V.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141505520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Cost-Driven Chip Partitioning Method for Heterogeneous 3D Integration","authors":"Cheng-Hsien Lin, Kuan-Ting Chen, Yi-Yu Liu, Allen C.-H. Wu, TingTing Hwang","doi":"10.1145/3672558","DOIUrl":"https://doi.org/10.1145/3672558","url":null,"abstract":"3D IC offers significant benefits in terms of performance and cost. Existing research in through-silicon via (TSV)-based 3D integration circuit (IC) partitioning has focused on minimizing the number of TSVs to reduce costs. Partitioning methods based on heterogeneous integration have emerged as viable approaches for cost optimization. Leveraging mature processes to manufacture not timing-critical blocks can yield cost benefits. Nevertheless, none of the previous 3D partitioning work has focused on reducing the overall cost, including both design and manufacturing costs, for heterogeneous 3D integration. Moreover, throughput constraints have not been considered. This paper presents a cost-aware integer linear programming (ILP)-based formulation and a heuristic algorithm that partition the functional blocks in the design into different technological groups. Each group of functional blocks will be implemented using a particular process technology, and then integrated into a 3D IC. Our results show that 3D heterogeneous integration chip implementation can reduce overall cost while satisfying various timing constraints.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141341376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic Correction of Arithmetic Circuits in the Presence of Multiple Bugs by Groebner Basis Modification","authors":"Negar Aghapour Sabbagh, B. Alizadeh","doi":"10.1145/3672559","DOIUrl":"https://doi.org/10.1145/3672559","url":null,"abstract":"One promising approach to verify large arithmetic circuits is making use of Symbolic Computer Algebra (SCA), where the circuit and the specification are translated to a set of polynomials, and the verification is performed by the ideal membership testing. Here, the main problem is the monomial explosion for buggy arithmetic circuits, which makes obtaining the word-level remainder become unfeasible. So, automatic correction of such circuits remains a significant challenge. Our proposed correction method partitions the circuit based on primary output bits and modifies the related Groebner basis based on the given suspicious gates, which makes it independent of the word-level remainder. We have applied our method to various signed and unsigned multipliers, with various sizes and numbers of suspicious and buggy gates. The results show that the proposed method corrects the bugs without area overhead. Moreover, it is able to correct the buggy circuit on average 51.9 × and 45.72 × faster in comparison with the state-of-the-art correction techniques, having single and multiple bugs, respectively.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141351720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaoyu Sun, Xiaochen Peng, Sai Zhang, J. Gómez, W. Khwa, Syed Sarwar, Ziyun Li, Weidong Cao, Zhao Wang, Chiao Liu, Meng-Fan Chang, B. Salvo, Kerem Akarvardar, H.-S. Philip Wong
{"title":"Estimating Power, Performance, and Area for On-Sensor Deployment of AR/VR Workloads Using an Analytical Framework","authors":"Xiaoyu Sun, Xiaochen Peng, Sai Zhang, J. Gómez, W. Khwa, Syed Sarwar, Ziyun Li, Weidong Cao, Zhao Wang, Chiao Liu, Meng-Fan Chang, B. Salvo, Kerem Akarvardar, H.-S. Philip Wong","doi":"10.1145/3670404","DOIUrl":"https://doi.org/10.1145/3670404","url":null,"abstract":"Augmented Reality and Virtual Reality have emerged as the next frontier of intelligent image sensors and computer systems. In these systems, 3D die stacking stands out as a compelling solution, enabling in-situ processing capability of the sensory data for tasks such as image classification and object detection at low power, low latency, and a small form factor. These intelligent 3D CMOS Image Sensor (CIS) systems present a wide design space, encompassing multiple domains (e.g., computer vision algorithms, circuit design, system architecture, and semiconductor technology, including 3D stacking) that have not been explored in-depth so far. This paper aims to fill this gap. We first present an analytical evaluation framework, STAR-3DSim, dedicated to rapid pre-RTL evaluation of 3D-CIS systems capturing the entire stack from the pixel layer to the on-sensor processor layer. With STAR-3DSim, we then propose several knobs for PPA (power, performance, area) improvement of the Deep Neural Network (DNN) accelerator that can provide up to 53%, 41%, and 63% reduction in energy, latency, and area, respectively, across a broad set of relevant AR/VR workloads. Lastly, we present full-system evaluation results by taking image sensing, cross-tier data transfer, and off-sensor communication into consideration.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-06-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141373733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiseung Kim, Hyunsei Lee, Mohsen Imani, Yeseong Kim
{"title":"Advancing Hyperdimensional Computing Based on Trainable Encoding and Adaptive Training for Efficient and Accurate Learning","authors":"Jiseung Kim, Hyunsei Lee, Mohsen Imani, Yeseong Kim","doi":"10.1145/3665891","DOIUrl":"https://doi.org/10.1145/3665891","url":null,"abstract":"<p>Hyperdimensional computing (HDC) is a computing paradigm inspired by the mechanisms of human memory, characterizing data through high-dimensional vector representations, known as hypervectors. Recent advancements in HDC have explored its potential as a learning model, leveraging its straightforward arithmetic and high efficiency. The traditional HDC frameworks are hampered by two primary static elements: randomly generated encoders and fixed learning rates. These static components significantly limit model adaptability and accuracy. The static, randomly generated encoders, while ensuring high-dimensional representation, fail to adapt to evolving data relationships, thereby constraining the model’s ability to accurately capture and learn from complex patterns. Similarly, the fixed nature of the learning rate does not account for the varying needs of the training process over time, hindering efficient convergence and optimal performance. This paper introduces (mathsf {TrainableHD} ), a novel HDC framework that enables dynamic training of the randomly generated encoder depending on the feedback of the learning data, thereby addressing the static nature of conventional HDC encoders. (mathsf {TrainableHD} ) also enhances the training performance by incorporating adaptive optimizer algorithms in learning the hypervectors. We further refine (mathsf {TrainableHD} ) with effective quantization to enhance efficiency, allowing the execution of the inference phase in low-precision accelerators. Our evaluations demonstrate that (mathsf {TrainableHD} ) significantly improves HDC accuracy by up to 27.99% (averaging 7.02%) without additional computational costs during inference, achieving a performance level comparable to state-of-the-art deep learning models. Furthermore, (mathsf {TrainableHD} ) is optimized for execution speed and energy efficiency. Compared to deep learning on a low-power GPU platform like NVIDIA Jetson Xavier, (mathsf {TrainableHD} ) is 56.4 times faster and 73 times more energy efficient. This efficiency is further augmented through the use of Encoder Interval Training (EIT) and adaptive optimizer algorithms, enhancing the training process without compromising the model’s accuracy.</p>","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141253460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}