IEEE Transactions on Computers最新文献_第10页

A Framework for Carbon-Aware Real-Time Workload Management in Clouds Using Renewables-Driven Cores 使用可再生能源驱动核心的云中碳感知实时工作负载管理框架

IF 3.6 2区计算机科学

IEEE Transactions on Computers Pub Date : 2025-03-20 DOI: 10.1109/TC.2025.3571495

Tharindu B. Hewage;Shashikant Ilager;Maria A. Rodriguez;Rajkumar Buyya

{"title":"A Framework for Carbon-Aware Real-Time Workload Management in Clouds Using Renewables-Driven Cores","authors":"Tharindu B. Hewage;Shashikant Ilager;Maria A. Rodriguez;Rajkumar Buyya","doi":"10.1109/TC.2025.3571495","DOIUrl":"https://doi.org/10.1109/TC.2025.3571495","url":null,"abstract":"Cloud platforms commonly exploit workload temporal flexibility to reduce their carbon emissions. They suspend/resume workload execution for when and where the energy is greenest. However, increasingly prevalent delay-intolerant real-time workloads challenge this approach. To this end, we present a framework to harvest green renewable energy for real-time workloads in cloud systems. We use Renewables-driven cores in servers to dynamically switch CPU cores between real-time and low-power profiles, matching renewable energy availability. We then develop a VM Execution Model to guarantee that running VMs are allocated with cores in the real-time power profile. If such cores are insufficient, we conduct criticality-aware VM evictions as needed. Furthermore, we develop a VM Packing Algorithm to utilize available cores across the servers. We introduce the Green Cores concept in our algorithm to convert renewable energy usage into a server inventory attribute. Based on this, we jointly optimize for renewable energy utilization and reduction of VM eviction incidents. We implement a prototype of our framework in OpenStack as openstack-gc. Using an experimental openstack-gc cloud and a large-scale simulation testbed, we expose our framework to VMs running RTEval, a real-time evaluation program, and a 14-day Azure VM arrival trace. Our results show: i) a <inline-formula><tex-math>$ 6.52times $</tex-math></inline-formula> reduction in coefficient of variation of real-time latency over an existing workload temporal flexibility-based solution, and ii) a joint 79.64% reduction in eviction incidents with a 34.83% increase in energy harvest over the state-of-the-art packing algorithms. We open source openstack-gc at <uri>https://github.com/tharindu-b-hewage/openstack-gc</uri>.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 8","pages":"2757-2771"},"PeriodicalIF":3.6,"publicationDate":"2025-03-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144597983","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An FPGA-Based Open-Source Hardware-Software Framework for Side-Channel Security Research 基于fpga的边信道安全研究开源软硬件框架

IF 3.6 2区计算机科学

IEEE Transactions on Computers Pub Date : 2025-03-17 DOI: 10.1109/TC.2025.3551936

Davide Zoni;Andrea Galimberti;Davide Galli

{"title":"An FPGA-Based Open-Source Hardware-Software Framework for Side-Channel Security Research","authors":"Davide Zoni;Andrea Galimberti;Davide Galli","doi":"10.1109/TC.2025.3551936","DOIUrl":"https://doi.org/10.1109/TC.2025.3551936","url":null,"abstract":"Attacks based on side-channel analysis (SCA) pose a severe security threat to modern computing platforms, further exacerbated on IoT devices by their pervasiveness and handling of private and critical data. Designing SCA-resistant computing platforms requires a significant additional effort in the early stages of the IoT devices’ life cycle, which is severely constrained by strict time-to-market deadlines and tight budgets. This manuscript introduces a hardware-software framework meant for SCA research on FPGA targets. It delivers an IoT-class system-on-chip (SoC) that includes a RISC-V CPU, provides observability and controllability through an ad-hoc debug infrastructure to facilitate SCA attacks and evaluate the platform's security, and streamlines the deployment of SCA countermeasures through dedicated hardware and software features such as a DFS actuator and FreeRTOS support. The open-source release of the framework includes the SoC, the scripts to configure the computing platform, compile a target application, and assess the SCA security, as well as a suite of state-of-the-art attacks and countermeasures. The goal is to foster its adoption and novel developments in the field, empowering designers and researchers to focus on studying SCA countermeasures and attacks while relying on a sound and stable hardware-software platform as the foundation for their research.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 6","pages":"2087-2100"},"PeriodicalIF":3.6,"publicationDate":"2025-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143929740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Deep Learning Operators Performance Tuning for Changeable Sized Input Data on Tensor Accelerate Hardware 张量加速硬件上可变大小输入数据的深度学习算子性能调优

IF 3.6 2区计算机科学

IEEE Transactions on Computers Pub Date : 2025-03-17 DOI: 10.1109/TC.2025.3551937

Pengyu Mu;Yi Liu;Rui Wang;Guoxiang Liu;Hangcheng An;Qianhe Zhao;Hailong Yang;Chenhao Xie;Zhongzhi Luan;Chunye Gong;Depei Qian

{"title":"Deep Learning Operators Performance Tuning for Changeable Sized Input Data on Tensor Accelerate Hardware","authors":"Pengyu Mu;Yi Liu;Rui Wang;Guoxiang Liu;Hangcheng An;Qianhe Zhao;Hailong Yang;Chenhao Xie;Zhongzhi Luan;Chunye Gong;Depei Qian","doi":"10.1109/TC.2025.3551937","DOIUrl":"https://doi.org/10.1109/TC.2025.3551937","url":null,"abstract":"The operator library is the fundamental infrastructure of deep learning acceleration hardware. Automatically generating the library and tuning its performance is promising because the manual development by well-trained and skillful programmers is costly in terms of both time and money. Tensor hardware has the best computing efficiency for deep learning applications, but the operator library programs are hard to tune because the tensor hardware primitives have many limitations. Otherwise, the performance is difficult to be fully explored. The recent advancement in LLM exacerbates this problem because the size of input data is not fixed. Therefore, mapping the computing tasks of operators to tensor hardware units is a significant challenge when the shape of the input tensor is unknown before the runtime. We propose DSAT, a deep learning operator performance autotuning technique for changeable-sized input data on tensor hardware. To match the input tensor's undetermined shape, we choose a group of abstract computing units as the basic building blocks of operators for changeable-sized input tensor shapes. We design a group of programming tuning rules to construct a large exploration space of the variant implementation of the operator programs. Based on these rules, we construct an intermediate representation of computing and memory access to describe the computing process and use it to map the abstract computing units to tensor primitives. To speed up the tuning process, we narrow down the optimization space by predicting the actual hardware resource requirement and providing an optimized cost model for performance prediction. DSAT achieves performance comparable to the vendor's manually tuned operator libraries. Compared to state-of-the-art deep learning compilers, it improves the performance of inference by 13% on average and decreases the tuning time by an order of magnitude.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 6","pages":"2101-2113"},"PeriodicalIF":3.6,"publicationDate":"2025-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143929786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Anole: A Pragmatic Blend of Classic and Learning-Based Algorithms in Congestion Control 拥塞控制中经典算法和基于学习算法的实用融合

IF 3.6 2区计算机科学

IEEE Transactions on Computers Pub Date : 2025-03-15 DOI: 10.1109/TC.2025.3566872

Feixue Han;Yike Wang;Yunbo Zhang;Qing Li;Dayi Zhao;Yong Jiang

{"title":"Anole: A Pragmatic Blend of Classic and Learning-Based Algorithms in Congestion Control","authors":"Feixue Han;Yike Wang;Yunbo Zhang;Qing Li;Dayi Zhao;Yong Jiang","doi":"10.1109/TC.2025.3566872","DOIUrl":"https://doi.org/10.1109/TC.2025.3566872","url":null,"abstract":"In recent years, hybrid congestion control (CC) algorithms that combine rule-based CC and learning-based CC have gained significant attention. They incorporate the fast adaption ability of learning-based CC and the stability of rule-based CC, tending to select the better-performing rate based on the network feedback. However, the practical implementation of such algorithms has revealed primary issues. Specifically, they require both CCs to run alternately, which results in a poorly performing CC continuing to run in the network. Moreover, hybrid CCs cannot converge to the optimal rate when both CCs perform poorly. This paper proposes Anole to address these issues. Anole has three main algorithmic contributions: 1) Anole always selects the better-performing CC, 2) Anole temporarily deprecates the consistently underperforming CC, 3) when both CCs perform poorly, Anole infers the optimal sending rate based on the network feedback. We carry out comprehensive experiments in both emulated and real-world wired networks, as well as in real-world WiFi networks, to assess the performance of Anole. The experiment results demonstrate that Anole achieves approximately 6% higher throughput in real-world links and 34% lower delay in the 48Mbps link compared to the state-of-the-art CC. Anole also exhibits superior performance in adaptability and fair convergence.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 7","pages":"2501-2514"},"PeriodicalIF":3.6,"publicationDate":"2025-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144255513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Acceleration of Timing-Aware Gate-Level Logic Simulation Through One-Pass GPU Parallelism 通过单通GPU并行加速时序感知门级逻辑仿真

IF 3.6 2区计算机科学

IEEE Transactions on Computers Pub Date : 2025-03-13 DOI: 10.1109/TC.2025.3569135

Weijie Fang;Yanggeng Fu;Jiaquan Gao;Longkun Guo;Gregory Gutin;Xiaoyan Zhang

{"title":"Acceleration of Timing-Aware Gate-Level Logic Simulation Through One-Pass GPU Parallelism","authors":"Weijie Fang;Yanggeng Fu;Jiaquan Gao;Longkun Guo;Gregory Gutin;Xiaoyan Zhang","doi":"10.1109/TC.2025.3569135","DOIUrl":"https://doi.org/10.1109/TC.2025.3569135","url":null,"abstract":"Witnessing the advancements in the scale and complexity of chip design, along with the benefits from high-performance computing technologies, the simulation of Very Large Scale Integration (VLSI) circuits increasingly demands acceleration through parallel computing with GPU devices. However, conventional parallel strategies fail to fully leverage modern GPU capabilities, introducing new challenges in GPU-based parallelism for VLSI simulations despite previous demonstrations of significant acceleration. In this paper, we propose a novel approach for accelerating the simulation of 4-value logic timing-aware gate-level circuits through waveform-based GPU parallelism. Our approach introduces an innovative strategy that effectively manages task dependencies during the parallelism of combinational circuits, significantly reducing the synchronization requirement between CPU and GPU. The proposed approach achieves one-pass parallelism by requiring only a single round of data transfer. Moreover, to address the implementation challenges associated with our strategy on GPU devices, we have developed and optimized a series of data structures that dynamically allocate and store newly generated outputs of uncertain scale. Finally, we conduct experiments on industrial-scale open-source benchmarks to demonstrate our approach’s performance gains over several state-of-the-art baselines.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 8","pages":"2675-2686"},"PeriodicalIF":3.6,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144597875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

In-Situ NAS: A Plug-and-Search Neural Architecture Search Framework Across Hardware Platforms 原位NAS：一个跨硬件平台的即插即用的神经结构搜索框架

IF 3.8 2区计算机科学

IEEE Transactions on Computers Pub Date : 2025-03-13 DOI: 10.1109/TC.2025.3569161

Hao Lv;Lei Zhang;Ying Wang

{"title":"In-Situ NAS: A Plug-and-Search Neural Architecture Search Framework Across Hardware Platforms","authors":"Hao Lv;Lei Zhang;Ying Wang","doi":"10.1109/TC.2025.3569161","DOIUrl":"https://doi.org/10.1109/TC.2025.3569161","url":null,"abstract":"Hardware-aware Neural Architecture Search (HW-NAS) has garnered significant research interest due to its ability to automate the design of neural networks for various hardware platforms. Prevalent HW-NAS frameworks often use fast predictors to estimate network performance, bypassing the time-consuming actual profiling step. However, the resource-intensive nature of building these predictors and their accuracy limitations hinder their practical use in diverse deployment scenarios. In response, we emphasize the indispensable role of actual profiling in HW-NAS and explore efficiency optimization possibilities within the HW-NAS framework. We provide a systematic analysis of profiling overhead in HW-NAS and identify many redundant and unnecessary operations during the search phase. We then optimize the workflow and present In-situ NAS, which leverages similarity features and exploration history to eliminate redundancy and improve runtime efficiency. In-situ NAS also offers simplified interfaces to ease the user’s effort in managing the complex device-dependent profiling flow, enabling plug-and-search functionality across diverse hardware platforms. Experimental results show that In-situ NAS achieves an average 10x speedup across different hardware platforms while reducing the search overhead by 8x compared to predictor-based approaches in various deployment scenarios. Additionally, In-situ NAS consistently discovers networks with better accuracy (about 1.5%) across diverse hardware platforms compared to predictor-based NAS.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 9","pages":"2856-2869"},"PeriodicalIF":3.8,"publicationDate":"2025-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144831867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Cacomp: A Cloud-Assisted Collaborative Deep Learning Compiler Framework for DNN Tasks on Edge Cacomp：用于边缘DNN任务的云辅助协作深度学习编译器框架

IF 3.6 2区计算机科学

IEEE Transactions on Computers Pub Date : 2025-03-12 DOI: 10.1109/TC.2025.3569132

Weiwei Lin;Jinhui Lin;Haotong Zhang;Wentai Wu;Weizheng Wu;Zhetao Li;Keqin Li

{"title":"Cacomp: A Cloud-Assisted Collaborative Deep Learning Compiler Framework for DNN Tasks on Edge","authors":"Weiwei Lin;Jinhui Lin;Haotong Zhang;Wentai Wu;Weizheng Wu;Zhetao Li;Keqin Li","doi":"10.1109/TC.2025.3569132","DOIUrl":"https://doi.org/10.1109/TC.2025.3569132","url":null,"abstract":"With the development of edge computing, DNN services have been widely deployed on edge devices. The deployment efficiency of deep learning models relies on the optimization of inference and scheduling policy. However, traditional optimization methods on edge devices still suffer from prohibitively long tuning time due to devices’ low computational power. Meanwhile, the widely used scheduling algorithm, the dominant resource fairness algorithm (DRF algorithm), struggles to maximize the efficiency of model execution on edge devices and inevitably increases average waiting time as it is not applicable in the real-time distributed computing environment. In this paper, we propose Cacomp, a distributed cloud-assisted deep learning compiler framework that features accelerating the optimization on edge devices with assistance from the cloud and a novel inference task scheduling algorithm. Our framework utilizes the tuning records from the cloud devices and proposes a two-step distillation strategy to obtain the best tuning record set for the edge device. For the scheduling process, we propose an RD-DRF algorithm to allocate inference tasks to edge devices based on dominant resource matching in real time. Extensive results show that our framework can achieve up to 2.19x improvement in the optimization time compared with other methods on edge devices. Our proposed scheduling algorithm significantly shortens the average waiting time of inference tasks by 30% and improves resource utilization by 20% on edge devices.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 8","pages":"2663-2674"},"PeriodicalIF":3.6,"publicationDate":"2025-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144597615","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enabling Consistent Sensing Data Sharing Among IoT Edge Servers via Lightweight Consensus 通过轻量级共识实现物联网边缘服务器之间一致的感知数据共享

IF 3.6 2区计算机科学

IEEE Transactions on Computers Pub Date : 2025-03-11 DOI: 10.1109/TC.2025.3549616

Xiulong Liu;Zhiyuan Zheng;Hao Xu;Zhelin Liang;Gaowei Shi;Chenyu Zhang;Keqiu Li

{"title":"Enabling Consistent Sensing Data Sharing Among IoT Edge Servers via Lightweight Consensus","authors":"Xiulong Liu;Zhiyuan Zheng;Hao Xu;Zhelin Liang;Gaowei Shi;Chenyu Zhang;Keqiu Li","doi":"10.1109/TC.2025.3549616","DOIUrl":"https://doi.org/10.1109/TC.2025.3549616","url":null,"abstract":"Blockchain offers distinct advantages in terms of data credibility and provenance certification, and its fusion with Internet of Things (IoT) technology holds great promise. Nevertheless, IoT environments are marked by extensive node networks and intricate communication patterns, especially the sensing environment. The conventional blockchain consensus mechanism, hampered by its heavy reliance on computing resources and communication bandwidth, faces difficulties in ensuring seamless data exchange among IoT edge servers. The issues encountered by state-of-the-art Byzantine Fault Tolerance (BFT) consensus include: (i) high communication complexity between nodes; and (ii) the detrimental impact of Byzantine behavior on system performance. To overcome the above problems, we propose the lightweight blockchain consensus called AntB, firstly introducing the concept of sampling into the consensus and significantly reducing the number of participating consensus nodes from <inline-formula><tex-math>$N$</tex-math></inline-formula> to <inline-formula><tex-math>$n$</tex-math></inline-formula>, which lowers the consensus complexity to <inline-formula><tex-math>$mathbf{2cdot O(n)+O(N)}$</tex-math></inline-formula>. We design a dynamic reputation mechanism so that Byzantine nodes cannot control the sampling set to affect the activity of the consensus in the long term. When implementing AntB, we address three significant technical challenges: (i) to determine the optimal sample size, we propose a sampling calculation method based on statistical confidence intervals, where the sample size is primarily determined by the chosen confidence level and margin of error; (ii) to prevent Byzantine behavior, we devise a weighted random sampling mechanism utilizing reputation coefficients based on edge servers’ behaviors; and (iii) to maintain consensus activity and consistency after sampling, we propose the consensus mechanism for partial sampling and global verification to avert potential issues. We implement AntB and conduct performance evaluations in a server with 32 cores and 64GB of memory. The evaluation results indicate that, the more nodes participating in the process of consensus, the better the performance of AntB will be. Especially, compared to HotStuff, AntB has a 24.94% higher success rate and Transactions Per Second (TPS) can improve by 102.10% when the number of nodes is 300.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 6","pages":"2045-2057"},"PeriodicalIF":3.6,"publicationDate":"2025-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143929812","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AR-Light: Enabling Fast and Lightweight Multi-User Augmented Reality via Semantic Segmentation and Collaborative View Synchronization AR-Light：通过语义分割和协同视图同步实现快速轻量级多用户增强现实

IF 3.6 2区计算机科学

IEEE Transactions on Computers Pub Date : 2025-03-10 DOI: 10.1109/TC.2025.3549629

Yu Wen;Aamir Bader Shah;Ruizhi Cao;Chen Zhang;Jiefu Chen;Xuqing Wu;Chenhao Xie;Xin Fu

{"title":"AR-Light: Enabling Fast and Lightweight Multi-User Augmented Reality via Semantic Segmentation and Collaborative View Synchronization","authors":"Yu Wen;Aamir Bader Shah;Ruizhi Cao;Chen Zhang;Jiefu Chen;Xuqing Wu;Chenhao Xie;Xin Fu","doi":"10.1109/TC.2025.3549629","DOIUrl":"https://doi.org/10.1109/TC.2025.3549629","url":null,"abstract":"Multi-user Augmented Reality (MuAR) allows multiple users to interact with shared virtual objects, facilitated by exchanging environment information. Current MuAR systems rely on 3D point clouds for real-world analysis, view synchronization, object rendering, and movement tracking. However, the complexity of 3D point clouds leads to significant processing delays, with approximately 80% of overhead in commercial frameworks. This hampers usability and degrades user experience. Our analysis reveals that maintaining the facing side of the real-world scene in a stable environment provides sufficient information for virtual object placement and rendering. To address this, we introduce a lightweight quadtree structure, representing 2D scenes through semantic segmentation and geometry, as an alternative to 3D point clouds. Additionally, we propose a novel correction method to handle potential shifts in virtual object placement during view synchronization among users. Combining all designs, we implement a fast and lightweight MuAR framework named <italic>AR-Light and test our framework on commercial AR devices. The evaluation results on real-world applications demonstrate that AR-Light can achieve high performance in various real-world scenes while maintaining a comparable virtual object placement accuracy.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 6","pages":"2073-2086"},"PeriodicalIF":3.6,"publicationDate":"2025-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143929854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DESA: Dataflow Efficient Systolic Array for Acceleration of Transformers 用于变压器加速的数据流高效收缩阵列

IF 3.6 2区计算机科学

IEEE Transactions on Computers Pub Date : 2025-03-10 DOI: 10.1109/TC.2025.3549621

Zhican Wang;Hongxiang Fan;Guanghui He

{"title":"DESA: Dataflow Efficient Systolic Array for Acceleration of Transformers","authors":"Zhican Wang;Hongxiang Fan;Guanghui He","doi":"10.1109/TC.2025.3549621","DOIUrl":"https://doi.org/10.1109/TC.2025.3549621","url":null,"abstract":"Transformers have become prevalent in various Artificial Intelligence (AI) applications, spanning natural language processing to computer vision. Owing to their suboptimal performance on general-purpose platforms, various domain-specific accelerators that explore and utilize the model sparsity have been developed. Instead, we conduct a quantitative analysis of Transformers. (Transformers can be categorized into three types: Encoder-Only, Decoder-Only, and Encoder-Decoder. This paper focuses on Encoder-Only Transformers.) to identify key inefficiencies and adopt dataflow optimization to address them. These inefficiencies arise from 1) diverse matrix multiplication, 2) multi-phase non-linear operations and their dependencies, and 3) heavy memory requirements. We introduce a novel dataflow design to support decoupling with latency hiding, effectively reducing the dependencies and addressing the performance bottlenecks of nonlinear operations. To enable fully fused attention computation, we propose practical tiling and mapping strategies to sustain high throughput and notably decrease memory requirements from <inline-formula><tex-math>$O(N^{2}H)$</tex-math></inline-formula> to <inline-formula><tex-math>$O(N)$</tex-math></inline-formula>. A hybrid buffer-level reuse strategy is also introduced to enhance utilization and diminish off-chip access. Based on these optimizations, we propose a novel systolic array design, named DESA, with three innovations: 1) A reconfigurable vector processing unit (VPU) and immediate processing units (IPUs) that can be seamlessly fused within the systolic array to support various normalization, post-processing, and transposition operations with efficient latency hiding. 2) A hybrid stationary systolic array that improves the compute and memory efficiency for matrix multiplications with diverse operational intensity and characteristics. 3) A novel tile fusion processing that efficiently addresses the low utilization issue in the conventional systolic array during the data setup and offloading. Across various benchmarks, extensive experiments demonstrate that DESA archives <inline-formula><tex-math>$5.0boldsymbol{timesthicksim}8.3boldsymbol{times}$</tex-math></inline-formula> energy saving over 3090 GPU and <inline-formula><tex-math>$25.6boldsymbol{timesthicksim}88.4boldsymbol{times}$</tex-math></inline-formula> than Intel 6226R CPU. Compared to the SOTA designs, DESA achieves <inline-formula><tex-math>$11.6boldsymbol{timesthicksim}15.0boldsymbol{times}$</tex-math></inline-formula> speedup and up to <inline-formula><tex-math>$2.3times$</tex-math></inline-formula> energy saving over the SOTA accelerators.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 6","pages":"2058-2072"},"PeriodicalIF":3.6,"publicationDate":"2025-03-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143929676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0