{"title":"TightLLM: Maximizing Throughput for LLM Inference via Adaptive Offloading Policy","authors":"Yitao Hu;Xiulong Liu;Guotao Yang;Linxuan Li;Kai Zeng;Zhixin Zhao;Sheng Chen;Laiping Zhao;Wenxin Li;Keqiu Li","doi":"10.1109/TC.2025.3558009","DOIUrl":"https://doi.org/10.1109/TC.2025.3558009","url":null,"abstract":"Large language models (LLMs) have demonstrated remarkable performance across a wide range of tasks, largely due to their substantial model size. However, this also results in significant GPU memory demands during inference. To address these challenges on hardware with limited GPU memory, existing approaches employ offloading techniques that offload unused tensors to CPU memory, thereby reducing GPU memory usage. Since offloading involves data transfer between GPU and CPU, it introduces transfer overhead. To mitigate this, prior works typically overlap data transfer with GPU computation using a fixed pipelining strategy applied uniformly across all inference iterations, referred to as <italic>static</i> offloading. However, static offloading policies fail to maximize inference throughput because they cannot adapt to the dynamically changing transfer overhead during the inference process, leading to increasing GPU idleness and reduced inference throughput. We propose that offloading policies should be <italic>adaptive</i> to the varying transfer overhead across inference iterations to maximize inference throughput. To this end, we design and implement an adaptive offloading-based inference system called TightLLM with two key innovations. First, its key-value (KV) distributor employs a <italic>trade-compute-for-transfer</i> strategy to address growing transfer overhead by dynamically recomputing portions of the KV cache, effectively overlapping data transfer with computation and minimizing GPU idleness. Second, TightLLM's weight loader slices model weights and distributes the loading process <italic>across multiple batches</i>, amortizing the excessive weight loading overhead and significantly improving throughput. Evaluation across various combinations of GPU hardware and LLM models shows that TightLLM achieves 1.3 to 23 times higher throughput during the decoding phase and 1.2 to 22 times higher throughput in the prefill phase compared to state-of-the-art offloading systems. Due to the higher throughput in prefill and decoding phases, TightLLM can reduce the completion time for large-scale tasks, which involve processing and generating a substantial number of tokens, by 59.6% to 94.9%.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 7","pages":"2195-2209"},"PeriodicalIF":3.6,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144255616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junchao Li;Runsheng Hou;Guangyong Shang;Huanle Zhang;Xiuzhen Cheng;Runyu Pan
{"title":"FVM: Practical Feather-Weight Virtualization on Commodity Microcontrollers","authors":"Junchao Li;Runsheng Hou;Guangyong Shang;Huanle Zhang;Xiuzhen Cheng;Runyu Pan","doi":"10.1109/TC.2025.3558582","DOIUrl":"https://doi.org/10.1109/TC.2025.3558582","url":null,"abstract":"Recently, there has been an increasing drive to consolidate multiple microcontrollers into one physical entity, due to advantages in reducing overall costs, enhancing reliability, and simplifying hardware interconnections. To reduce consolidation engineering costs, minimizing system latency and memory footprint is important as well as maintaining compatibility with legacy software. In this paper, we propose a virtualization-based solution called Feather-weight Virtual Machine (<italic>FVM</i>) that focuses on these goals. <italic>FVM</i> enables low latency by specializing the virtualization model to Real-Time Operating Systems (RTOSes), achieves small footprint by adapting management policies to microcontroller memories, attains high compatibility by aligning with microcontroller ecosystem idiosyncrasies, finally allowing practical consolidation across a wide range of commodity microcontrollers. We implement and evaluate <italic>FVM</i> on ARMv6-M, ARMv7-M, and RISC-V architectures with two toolchains and two RTOSes, and it can fit into 20 KiB of RAM with less than 5% latency bloat.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 7","pages":"2389-2401"},"PeriodicalIF":3.6,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144255484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yunzhen Luo;Yan Ding;Zhuo Tang;Keqin Li;Kenli Li;Chubo Liu
{"title":"BEAST-GNN: A United Bit Sparsity-Aware Accelerator for Graph Neural Networks","authors":"Yunzhen Luo;Yan Ding;Zhuo Tang;Keqin Li;Kenli Li;Chubo Liu","doi":"10.1109/TC.2025.3558587","DOIUrl":"https://doi.org/10.1109/TC.2025.3558587","url":null,"abstract":"Graph Neural Networks (GNNs) excel in processing graph-structured data, making them attractive and promising for tasks such as recommender systems and traffic forecasting. However, GNNs’ irregular computational patterns limit their ability to achieve low latency and high energy efficiency, particularly in edge computing environments. Current GNN accelerators predominantly focus on value sparsity, underutilizing the potential performance gains from bit-level sparsity. However, applying existing bit-serial accelerators to GNNs presents several challenges. These challenges arise from GNNs’ more complex data flow compared to conventional neural networks, as well as difficulties in data localization and load balancing with irregular graph data. To address these challenges, we propose BEAST-GNN, a bit-serial GNN accelerator that fully exploits bit-level sparsity. BEAST-GNN introduces streamlined sparse-dense bit matrix multiplication for optimized data flow, a column-overlapped graph partitioning method to enhance data locality by reducing memory access inefficiencies, and a sparse bit-counting strategy to ensure balanced workload distribution across processing elements (PEs). Compared to state-of-the-art accelerators, including HyGCN, GCNAX, Laconic, GROW, I-GCN, SGCN, and MEGA, BEAST-GNN achieves speedups of 21.7<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula>, 6.4<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula>, 10.5<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula>, 3.7<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula>, 4.0<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula>, 3.3<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula>, and 1.4<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> respectively, while also reducing DRAM access by 36.3<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula>, 7.9<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula>, 6.6<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula>, 3.9<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula>, 5.38<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula>, 3.37<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula>, and 1.44<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula>. Additionally, BEAST-GNN consumes only 4.8%, 12.4%, 19.6%, 27.7%, 17.0%, 26.5%, and 82.8% of the energy required by these architectures.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 7","pages":"2402-2416"},"PeriodicalIF":3.6,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144255619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yichuan Bai;Xiaopeng Zhang;Qian Wang;Yaqing Li;Yuan Du;Li Du
{"title":"BE-NPU: A Bandwidth-Efficient Neural Processing Unit With Adaptive Processing Schemes for Reduced Off-Chip Bandwidth Demand","authors":"Yichuan Bai;Xiaopeng Zhang;Qian Wang;Yaqing Li;Yuan Du;Li Du","doi":"10.1109/TC.2025.3558579","DOIUrl":"https://doi.org/10.1109/TC.2025.3558579","url":null,"abstract":"Existing neural processing units (NPUs) mainly focus on the optimized multiply-accumulate (MAC) arrays for efficient inference of convolutional neural networks (CNNs). However, off-chip data transmission usually keeps NPUs waiting during CNN inference, causing up to 38.4GB/s off-chip bandwidth (OCB) demand for mobile AI devices. And none of the previous benchmarks quantitatively evaluate the bandwidth efficiency of different NPU architectures. In addition, CNNs exhibit distinct characteristics of off-chip data transmission when applied to different fields, and it has become a challenging task for NPUs to support different CNNs efficiently with reasonable OCB demand. To address the aforementioned issues, this paper proposes the Bandwidth-Peak Performance Ratio for n percentages of ideal frame rate (BPPR-n%) to demonstrate the normalized OCB demand of different NPU architectures. A bandwidth-efficient NPU (BE-NPU) is introduced with adaptive processing schemes to reduce the OCB demand during inference of different CNNs. The adaptive processing schemes include both instruction-level and thread-level schemes. For the instruction-level scheme, decoupled execute/access is introduced into depth-first (DF) and layer-first (LF) schemes to improve the concurrency between NPU calculation (CAL) and direct memory access (DMA) instructions. For the thread-level scheme, DF and LF threads are hybridly processed to further improve overall NPU efficiency. Compared with state-of-the-art works, BE-NPU achieves 48.1%∼80.6% reduction of BPPR-80% and 67.0%∼95.1% reduction of BPPR-95%. The proposed architecture is synthesized with TSMC 28nm technology node. BE-NPU utilizes 14.3% additional logic gates compared with baseline implementation.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 7","pages":"2376-2388"},"PeriodicalIF":3.6,"publicationDate":"2025-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144255615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Quantum Secure Vector Dominance and Its Applications in Computational Geometry","authors":"Wenjie Liu;Bingmei Su;Feiyang Sun","doi":"10.1109/TC.2025.3557968","DOIUrl":"https://doi.org/10.1109/TC.2025.3557968","url":null,"abstract":"Secure vector dominance is a key cryptographic primitive in secure computational geometry (SCG), determining the dominance relationship of vectors between two participants without revealing their private information. However, the security of traditional SVD protocols is compromised by the formidable computational power of quantum computing, and their efficiency needs further improvement. To address these challenges, an efficient quantum secure vector dominance (QSVD) protocol is proposed. Specifically, we first introduce a quantum private permutation (QPP) subprotocol to shuffle the elements of each participant's private input vector. To further facilitate secure data comparison, we propose an enhanced quantum millionaire subprotocol with equality determination functionality, building upon Jia's original protocol. Based on the above two subprotocols, we propose a QSVD protocol with polynomial complexity, deriving vector dominance in a single interaction with a semi-honest third party. Performance analyses confirm that QSVD protocol is correct, resilient against malicious attacks, and retains polynomial computational complexity, ensuring both security and efficiency. To demonstrate the scalability of the QSVD protocol, we illustrate its applications in several geometric computation problems, such as point-line inclusion determination, line-line intersect determination, and point-in-polygon determination. Finally, we validate the feasibility of our protocol by conducting comprehensive simulations on IBM's Qiskit platform, demonstrating its practical applicability and effectiveness in real quantum computing environments.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 6","pages":"2129-2143"},"PeriodicalIF":3.6,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10949787","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143929750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HiCoCS: High Concurrency Cross-Sharding on Permissioned Blockchains","authors":"Lingxiao Yang;Xuewen Dong;Zhiguo Wan;Di Lu;Yushu Zhang;Yulong Shen","doi":"10.1109/TC.2025.3558001","DOIUrl":"https://doi.org/10.1109/TC.2025.3558001","url":null,"abstract":"As the foundation of the Web3 trust system, blockchain technology faces increasing demands for scalability. Sharding emerges as a promising solution, but it struggles to handle highly concurrent cross-shard transactions (<monospace>CSTx</monospace>s), primarily due to simultaneous ledger operations on the same account. Hyperledger Fabric, a permissioned blockchain, employs multi-version concurrency control for parallel processing. Existing solutions use channels and intermediaries to achieve cross-sharding in Hyperledger Fabric. However, the conflict problem caused by highly concurrent <monospace>CSTx</monospace>s has not been adequately resolved. To fill this gap, we propose HiCoCS, a high concurrency cross-shard scheme for permissioned blockchains. HiCoCS creates a unique virtual sub-broker for each <monospace>CSTx</monospace> by introducing a composite key structure, enabling conflict-free concurrent transaction processing while reducing resource overhead. The challenge lies in managing large numbers of composite keys and mitigating intermediary privacy risks. HiCoCS utilizes virtual sub-brokers to receive and process <monospace>CSTx</monospace>s concurrently while maintaining a transaction pool. Batch processing is employed to merge multiple <monospace>CSTx</monospace>s in the pool, improving efficiency. We explore composite key reuse to reduce the number of virtual sub-brokers and lower system overhead. Privacy preservation is enhanced using homomorphic encryption. Evaluations show that HiCoCS improves cross-shard transaction throughput by 3.5-20.2 times compared to the baselines.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 7","pages":"2168-2182"},"PeriodicalIF":3.6,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144255743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Schedulability Analysis for Self-Suspending Tasks Under EDF-Like Scheduling","authors":"Yan Wang;Bo Lv;Quan Zhou;Junfei Li;Tan Tan","doi":"10.1109/TC.2025.3558079","DOIUrl":"https://doi.org/10.1109/TC.2025.3558079","url":null,"abstract":"Real-time systems involve tasks that may voluntarily suspend their execution as they await specific events or resources. Such self-suspension can introduce further delays and unpredictability in scheduling, making the analysis more challenging. Most current schedulability analysis methods of self-suspending tasks focus on fixed-priority scheduling or tasks with constrained deadlines. This paper proposes two schedulability analysis methods for self-suspending tasks with arbitrary deadlines under earliest-deadline-first-like (EDF-like) scheduling. Both methods are designed for preemptive uniprocessor systems. We first present a jitter-based response time analysis (JRTA) method. JRTA is designed based on a self-suspending response time analysis (SS-RTA) method under earliest-deadline-first (EDF) scheduling. We first convert self-suspensions to release jitters and then present a response time analysis (RTA) method of tasks with release jitters under EDF-like scheduling. To address the complexity of JRTA, we propose an improved schedulability analysis (ISA), a sufficiency blocking-based method. Finally, we provide many simulation experiments under some EDF-like scheduling algorithms. The results verify the effectiveness and efficiency of both proposed methods.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 7","pages":"2364-2375"},"PeriodicalIF":3.6,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144255760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CINDA: Using Cache-Coherent Interconnects for Accelerating Databases by Enabling Near-Data Processing of Update Transactions","authors":"Sajjad Tamimi;Arthur Bernhardt;Florian Stock;Ilia Petrov;Andreas Koch","doi":"10.1109/TC.2025.3558028","DOIUrl":"https://doi.org/10.1109/TC.2025.3558028","url":null,"abstract":"Near-Data Processing (NDP) has been proven useful to accelerate Database Management Systems (DBMS) that handle infrequently accessed data stored in slow persistent storage. A key challenge for such an architecture is the synchronization of host-based and NDP operations, which require fine-grained interactions especially when the NDP device can also update (modify) the DBMS data autonomously. This paper introduces <monospace>CINDA</monospace>, the first full-stack computational storage capable of accelerating <i>both</i> read and update (write) database transactions using NDP. The proposed system relies on a hybrid host-device interface to enable the DBMS accessing persisted data, offloading computation to the storage device, and coordinating concurrent device-update operations with the host-update ones. A hybrid interface utilizes a cache-coherent interconnect such as CCIX or CXL for low-latency synchronization using a shared-lock table, and PCIe DMA for high-throughput bulk I/O. We evaluated the effectiveness of the proposed approach in a CCIX-based system by realizing an FPGA-based NDP-capable computational storage device and customizing an NDP-capable DBMS based on PostgreSQL to support update NDP operations. Our full-stack evaluation using the YCSB benchmark demonstrates that <monospace>CINDA</monospace> can deliver <inline-formula><tex-math>$approx$</tex-math></inline-formula>4.2<inline-formula><tex-math>$times$</tex-math></inline-formula> end-to-end speedup when executing long-running update transactions directly on the storage device, while the host DBMS performs frequent short updates.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 7","pages":"2238-2252"},"PeriodicalIF":3.6,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144255508","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaoqian Wu;Peng Wang;Shaoquan Li;Huaxiao Liu;Lei Liu
{"title":"An Area Optimization Approach for Large-Scale RM-TB Dual Logic Circuits Based on a Multitasking Optimization Algorithm","authors":"Xiaoqian Wu;Peng Wang;Shaoquan Li;Huaxiao Liu;Lei Liu","doi":"10.1109/TC.2025.3558077","DOIUrl":"https://doi.org/10.1109/TC.2025.3558077","url":null,"abstract":"Logic synthesis is a crucial step in integrated circuit design, and area optimization is an indispensable part of this process. However, the area optimization problem for large-scale Fixed Polarity Reed-Muller (FPRM) circuits is an NP-hard problem. To address this problem, we divide Boolean circuits into small-scale circuits based on the idea of divide-and-conquer using the proposed grouping decomposition mechanism. Each small-scale Boolean circuit is transformed into an FPRM circuit by a polarity transformation algorithm. To ensure the circuit's functionality remains unaffected, we integrate FPRM circuits into an FPRM and Boolean (RM-TB) dual logic circuit based on the proposed gate-level integration. However, the area optimization problem of RM-TB dual logic circuits is a multi-task, high-dimensional, and multi-extremal combinatorial optimization problem. Therefore, we propose a Multipopulation Multitasking Optimization Algorithm (MMuOA) that integrates self-evolution with a multitasking equilibrium optimizer and cross-task evolution through knowledge sharing and transfer. This forms a dynamic optimization framework for simultaneously searching for the optimal polarity corresponding to the minimal area of RM-TB dual logic circuits. Moreover, we propose an Area Optimization Approach (AOA) for an RM-TB dual logic circuit with the minimum area using the MMuOA. Experimental results based on the Microelectronics Center of North Carolina (MCNC) Benchmark test circuits demonstrate the effectiveness and superiority of the AOA compared to the state-of-the-art area optimization approach.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 7","pages":"2348-2363"},"PeriodicalIF":3.6,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144264217","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Differential Fault Attack on HE-Friendly Stream Ciphers: Masta, Pasta, and Elisabeth","authors":"Weizhe Wang;Deng Tang","doi":"10.1109/TC.2025.3558036","DOIUrl":"https://doi.org/10.1109/TC.2025.3558036","url":null,"abstract":"In this paper, we propose the Differential Fault Attack (DFA) on three Homomorphic Encryption (HE) friendly stream ciphers <monospace>Masta</monospace>, <monospace>Pasta</monospace>, and <monospace>Elisabeth</monospace>. Both <monospace>Masta</monospace> and <monospace>Pasta</monospace> are <monospace>Rasta</monospace>-like ciphers with publicly derived and pseudorandom affine layers. The design of <monospace>Elisabeth</monospace> is an extension of <monospace>FLIP</monospace> and <monospace>FiLIP</monospace>, following the group filter permutator paradigm. All these three ciphers operate on elements over <inline-formula><tex-math>$mathbb{Z}_{p}$</tex-math></inline-formula> or <inline-formula><tex-math>$mathbb{Z}_{2^{n}}$</tex-math></inline-formula>, rather than <inline-formula><tex-math>$mathbb{Z}_{2}$</tex-math></inline-formula>. We can recover the secret keys of all the targeted ciphers through DFA. In particular, for <monospace>Elisabeth</monospace>, we present a new method to determine the filtering path, which is vital to make the attack practical. Our attacks on various instances of <monospace>Masta</monospace> are practical and require only one block of keystream and a single word-based fault. By injecting three word-based faults, we can theoretically mount DFA on two instances of <monospace>Pasta</monospace>, <monospace>Pasta</monospace>-3 and <monospace>Pasta</monospace>-4. For <monospace>Elisabeth</monospace>-4, the only instance of the <monospace>Elisabeth</monospace> family, we present two DFAs in which we inject four bit-based faults or a single word-based fault. With 15000 normal and faulty keystream words, the DFA on <monospace>Elisabeth</monospace>-4 can be completed in just a few minutes.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 7","pages":"2267-2277"},"PeriodicalIF":3.6,"publicationDate":"2025-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144255510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}