Yi Su;Wenhao Fan;Qingcheng Meng;Penghui Chen;Yuan'an Liu
{"title":"Joint Adaptive Aggregation and Resource Allocation for Hierarchical Federated Learning Systems Based on Edge-Cloud Collaboration","authors":"Yi Su;Wenhao Fan;Qingcheng Meng;Penghui Chen;Yuan'an Liu","doi":"10.1109/TCC.2025.3530681","DOIUrl":"https://doi.org/10.1109/TCC.2025.3530681","url":null,"abstract":"Hierarchical federated learning shows excellent potential for communication-computation trade-offs and reliable data privacy protection by introducing edge-cloud collaboration. Considering non-independent and identically distributed data distribution among devices and edges, this article aims to minimize the final loss function under time and energy budget constraints by optimizing the aggregation frequency and resource allocation jointly. Although there is no closed-form expression relating the final loss function to optimization variables, we divide the hierarchical federated learning process into multiple cloud intervals and analyze the convergence bound for each cloud interval. Then, we transform the initial problem into one that can be adaptively optimized in each cloud interval. We propose an adaptive hierarchical federated learning process, termed as AHFLP, where we determine edge and cloud aggregation frequency for each cloud interval based on estimated parameters, and then the CPU frequency of devices and wireless channel bandwidth allocation can be optimized in each edge. Simulations are conducted under different models, datasets and data distributions, and the results demonstrate the superiority of our proposed AHFLP compared with existing schemes.","PeriodicalId":13202,"journal":{"name":"IEEE Transactions on Cloud Computing","volume":"13 1","pages":"369-382"},"PeriodicalIF":5.3,"publicationDate":"2025-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143570725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Energy-Aware Offloading of Containerized Tasks in Cloud Native V2X Networks","authors":"Estela Carmona-Cejudo;Francesco Iadanza","doi":"10.1109/TCC.2025.3529245","DOIUrl":"https://doi.org/10.1109/TCC.2025.3529245","url":null,"abstract":"In cloud-native environments, executing vehicle-to-everything (V2X) tasks in edge nodes close to users significantly reduces service end-to-end latency. Containerization further reduces resource and time consumption, and, subsequently, application latency. Since edge nodes are typically resource and energy-constrained, optimizing offloading decisions and managing edge energy consumption is crucial. However, the offloading of containerized tasks has not been thoroughly explored from a practical implementation perspective. This paper proposes an optimization framework for energy-aware offloading of V2X tasks implemented as Kubernetes pods. A weighted utility function is derived based on cumulative pod response time, and an edge-to-cloud offloading decision algorithm (ECODA) is proposed. The system's energy cost model is derived, and a closed-loop repeated reward-based mechanism for CPU adjustment is presented. An energy-aware (EA)-ECODA is proposed to solve the offloading optimization problem while adjusting CPU usage according to energy considerations. Simulations show that ECODA and EA-ECODA outperform first-in, first-served (FIFS) and EA-FIFS in terms of utility, average pod response time, and resource usage, with low computational complexity. Additionally, a real testbed evaluation of a vulnerable road user application demonstrates that ECODA outperforms Kubernetes vertical scaling in terms of service-level delay. Moreover, EA-ECODA significantly improves energy usage utility.","PeriodicalId":13202,"journal":{"name":"IEEE Transactions on Cloud Computing","volume":"13 1","pages":"336-350"},"PeriodicalIF":5.3,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143570758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hybrid Serverless Platform for Smart Deployment of Service Function Chains","authors":"Sheshadri K R;J. Lakshmi","doi":"10.1109/TCC.2025.3528573","DOIUrl":"https://doi.org/10.1109/TCC.2025.3528573","url":null,"abstract":"Cloud Data Centres deal with dynamic changes all the time. Networks in particular, need to adapt their configurations to changing workloads. Given these expectations, Network Function Virtualization (NFV) using Software Defined Networks (SDNs) has realized the aspect of programmability in networks. NFVs allow network services to be programmed as software entities that can be deployed on commodity clusters in the Cloud. Being software, they inherently carry the ability to be customized to specific tenants’ requirements and thus support multi-tenant variations with ease. However, the ability to exploit scaling in alignment with changing demands with minimal loss of service, and improving resource usage efficiency still remains a challenge. Several recent works in literature have proposed platforms to realize Virtual Network functions (VNFs) on the Cloud using service offerings such as Infrastructure as a Service (IaaS) and serverless computing. These approaches are limited by deployment difficulties (configuration and sizing), adaptability to performance requirements (elastic scaling), and changing workload dynamics (scaling and customization). In the current work, we propose a Hybrid Serverless Platform (HSP) to address these identified lacunae. The HSP is implemented using a combination of persistent IaaS, and FaaS components. The IaaS components handle the steady state load, whereas the FaaS components activate during the dynamic change associated with scaling to minimize service loss. The HSP controller takes provisioning decisions based on Quality of Service (QoS) rules and flow statistics using an auto recommender, alleviating users of sizing decisions for function deployment. HSP controller design exploits data locality in SFC realization, reducing data-transfer times between VNFs. It also enables the usage of application characteristics to offer higher control over SFC deployment. A proof-of-concept realization of HSP is presented in the paper and is evaluated for a representative Service Function Chain (SFC) for a dynamic workload, which shows minimal loss in flowlet service, up to 35% resource savings as compared to a pure IaaS deployment and up to 55% lower end-to-end times as compared to a baseline FaaS implementation.","PeriodicalId":13202,"journal":{"name":"IEEE Transactions on Cloud Computing","volume":"13 1","pages":"351-368"},"PeriodicalIF":5.3,"publicationDate":"2025-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143570703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CARL: Cost-Optimized Online Container Placement on VMs Using Adversarial Reinforcement Learning","authors":"Prathamesh Saraf Vinayak;Saswat Subhajyoti Mallick;Lakshmi Jagarlamudi;Anirban Chakraborty;Yogesh Simmhan","doi":"10.1109/TCC.2025.3528446","DOIUrl":"https://doi.org/10.1109/TCC.2025.3528446","url":null,"abstract":"Containerization has become popular for the deployment of applications on public clouds. Large enterprises may host 100 s of applications on 1000 s containers that are placed onto Virtual Machines (VMs). Such placement decisions happen continuously as applications are updated by DevOps pipelines that deploy the containers. Managing the placement of container resource requests onto the available capacities of VMs needs to be cost-efficient. This is well-studied, and usually modelled as a multi-dimensional Vector Bin-packing Problem (VBP). Many heuristics, and recently machine learning approaches, have been developed to solve this NP-hard problem for real-time decisions. We propose CARL, a novel approach to solve VBP through Adversarial Reinforcement Learning (RL) for cost minimization. It mimics the placement behavior of an offline semi-optimal VBP solver (teacher), while automatically learning a reward function for reducing the VM costs which out-performs the teacher. It requires limited historical container workload traces to train, and is resilient to changes in the workload distribution during inferencing. We extensively evaluate CARL on workloads derived from realistic traces from Google and Alibaba for the placement of 5 k–10 k container requests onto 2 k–8 k VMs, and compare it with classic heuristics and state-of-the-art RL methods. (1) CARL is <i>fast</i>, e.g., making placement decisions at <inline-formula><tex-math>$approx 1900$</tex-math></inline-formula> requests/sec onto 8,900 candidate VMs. (2) It is <i>efficient</i>, achieving <inline-formula><tex-math>$approx 16%$</tex-math></inline-formula> lower VM costs than classic and contemporary RL methods. (3) It is <i>robust</i> to changes in the workload, offering competitive results even when the resource needs or inter-arrival time of the container requests skew from the training workload.","PeriodicalId":13202,"journal":{"name":"IEEE Transactions on Cloud Computing","volume":"13 1","pages":"321-335"},"PeriodicalIF":5.3,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143570653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lizhuang Tan;Zhuo Jiang;Kefei Liu;Haoran Wei;Pengfei Huo;Huiling Shi;Wei Zhang;Wei Su
{"title":"ByteTuning: Watermark Tuning for RoCEv2","authors":"Lizhuang Tan;Zhuo Jiang;Kefei Liu;Haoran Wei;Pengfei Huo;Huiling Shi;Wei Zhang;Wei Su","doi":"10.1109/TCC.2025.3525496","DOIUrl":"https://doi.org/10.1109/TCC.2025.3525496","url":null,"abstract":"RDMA over Converged Ethernet v2 (RoCEv2) is one of the most popular high-speed datacenter networking solutions. Watermark is the general term for various trigger and release thresholds of RoCEv2 flow control protocols, and its reasonable configuration is an important factor affecting RoCEv2 performance. In this paper, we propose ByteTuning, a centralized watermark tuning system for RoCEv2. First, three real cases of network performance degradation caused by non-optimal or improper watermark configuration are reported, and the network performance results of different watermark configurations in three typical scenarios are traversed, indicating the necessity of watermark tuning. Then, based on the RDMA Fluid model, the influence of watermark on the RoCEv2 performance is modeled and evaluated. Next, the design of the ByteTuning is introduced, which includes three mechanisms. They are 1) using simulated annealing algorithm to make the real-time watermark converge to the near-optimal configuration, 2) using network telemetry to optimize the feedback overhead, 3) compressing the search space to improve the tuning efficiency. Finally, We validate the performance of ByteTuning in multiple real datacenter networking environments, and the results show that ByteTuning outperforms existing solutions.","PeriodicalId":13202,"journal":{"name":"IEEE Transactions on Cloud Computing","volume":"13 1","pages":"303-320"},"PeriodicalIF":5.3,"publicationDate":"2025-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143570724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaofeng Ji;Faming Gong;Nuanlai Wang;Junjie Xu;Xing Yan
{"title":"Cloud-Edge Collaborative Service Architecture With Large-Tiny Models Based on Deep Reinforcement Learning","authors":"Xiaofeng Ji;Faming Gong;Nuanlai Wang;Junjie Xu;Xing Yan","doi":"10.1109/TCC.2024.3525076","DOIUrl":"https://doi.org/10.1109/TCC.2024.3525076","url":null,"abstract":"Offshore drilling platforms (ODPs) are critical infrastructure for exploring and developing marine oil and gas resources. As these platforms’ capabilities expand, deploying intelligent surveillance services to ensure safe production has become increasingly important. However, the unique geographical locations and harsh environmental conditions of ODPs pose significant challenges for processing large volumes of video data, complicating the implementation of efficient surveillance systems. This study proposes a Cloud-Edge Large-Tiny Model Collaborative (CELTC) architecture grounded in deep reinforcement learning to optimize the processing and decision-making of surveillance data in offshore drilling platform scenarios. CELTC architecture leverages edge-cloud computing, deploying complex, high-precision large models on cloud servers and lightweight tiny models on edge devices. This dual deployment strategy capitalizes on tiny models’ rapid response and large cloud models’ high-precision capabilities. Additionally, the architecture integrates a deep reinforcement learning algorithm designed to optimize the scheduling and offloading of computational tasks between large and tiny models in the cloud-edge environment. The efficacy of the proposed architecture is validated using real-world surveillance data from ODPs through simulations and comparative experiments.","PeriodicalId":13202,"journal":{"name":"IEEE Transactions on Cloud Computing","volume":"13 1","pages":"288-302"},"PeriodicalIF":5.3,"publicationDate":"2025-01-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143570677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shijing Yuan;Yuxin Liu;Song Guo;Jie Li;Hongyang Chen;Chentao Wu;Yang Yang
{"title":"Efficient Online Computing Offloading for Budget- Constrained Cloud-Edge Collaborative Video Streaming Systems","authors":"Shijing Yuan;Yuxin Liu;Song Guo;Jie Li;Hongyang Chen;Chentao Wu;Yang Yang","doi":"10.1109/TCC.2024.3524310","DOIUrl":"https://doi.org/10.1109/TCC.2024.3524310","url":null,"abstract":"Cloud-Edge Collaborative Architecture (CEA) is a prominent framework that provides low-latency and energy-efficient solutions for video stream processing. In Cloud-Edge Collaborative Video Streaming Systems (CEAVS), efficient online offloading strategies for video tasks are crucial for enhancing user experience. However, most existing works overlook budget constraints, which limits their applicability in real-world scenarios constrained by finite resources. Moreover, they fail to adequately address the heterogeneity of video task redundancies, leading to suboptimal utilization of CEAVS's limited resources. To bridge these gaps, we propose an Efficient Online Computing framework for CEAVS (EOCA) that jointly optimizes accuracy, energy consumption, and latency performance through adaptive online offloading and redundancy compression, without requiring future task information. Technically, we formulate computing offloading and adaptive compression under budget constraints as a stochastic optimization problem that maximizes system satisfaction, defined as a weighted combination of accuracy, latency, and energy performance. We employ Lyapunov optimization to decouple the long-term budget constraint. We prove that the decoupled problem is a generalized ordinal potential game and propose algorithms based on generalized Benders decomposition (GBD) and the best response to obtain Nash equilibrium strategies for computing offloading and task compression. Finally, we analyze EOCA's performance bound, convergence rate, and worst-case performance guarantees. Evaluations demonstrate that EOCA effectively improves satisfaction while effectively balancing satisfaction and computational overhead.","PeriodicalId":13202,"journal":{"name":"IEEE Transactions on Cloud Computing","volume":"13 1","pages":"273-287"},"PeriodicalIF":5.3,"publicationDate":"2024-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143570606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Differentially Private and Truthful Reverse Auction With Dynamic Resource Provisioning for VNFI Procurement in NFV Markets","authors":"Xueyi Wang;Xingwei Wang;Zhitong Wang;Rongfei Zeng;Ruiyun Yu;Qiang He;Min Huang","doi":"10.1109/TCC.2024.3522963","DOIUrl":"https://doi.org/10.1109/TCC.2024.3522963","url":null,"abstract":"With the advent of network function virtualization (NFV), many users resort to network service provisioning through virtual network function instances (VNFIs) run on the standard physical server in clouds. Following this trend, NFV markets are emerging, which allow a user to procure VNFIs from cloud service providers (CSPs). In such procurement process, it is a significant challenge to ensure differential privacy and truthfulness while explicitly considering dynamic resource provisioning, location sensitiveness and budget of each VNFI. As such, we design a differentially private and truthful reverse auction with dynamic resource provisioning (PTRA-DRP) to resolve the VNFI procurement (VNFIP) problem. To allow dynamic resource provisioning, PTRA-DRP enables CSPs to submit a set of bids and accept as many as possible, and decides the provisioning VNFIs based on the auction outcomes. To be specific, we first devise a greedy heuristic approach to select the set of the winning bids in a differentially privacy-preserving manner. Next, we design a pricing strategy to compute the charges of CSPs, aiming to guarantee truthfulness. Strict theoretical analysis proves that PTRA-DRP can ensure differential privacy, truthfulness, individual rationality, computational efficiency and approximate social cost minimization. Extensive simulations also demonstrate the effectiveness and efficiency of PTRA-DRP.","PeriodicalId":13202,"journal":{"name":"IEEE Transactions on Cloud Computing","volume":"13 1","pages":"259-272"},"PeriodicalIF":5.3,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143570599","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SROdcn: Scalable and Reconfigurable Optical DCN Architecture for High-Performance Computing","authors":"Kassahun Geresu;Huaxi Gu;Xiaoshan Yu;Meaad Fadhel;Hui Tian;Wenting Wei","doi":"10.1109/TCC.2024.3523433","DOIUrl":"https://doi.org/10.1109/TCC.2024.3523433","url":null,"abstract":"Data Center Network (DCN) flexibility is critical for providing adaptive and dynamic bandwidth while optimizing network resources to manage variable traffic patterns generated by heterogeneous applications. To provide flexible bandwidth, this work proposes a machine learning approach with a new Scalable and Reconfigurable Optical DCN (SROdcn) architecture that maintains dynamic and non-uniform network traffic according to the scale of the high-performance optical interconnected DCN. Our main device is the Fiber Optical Switch (FOS), which offers competitive wavelength resolution. We propose a new top-of-rack (ToR) switch that utilizes Wavelength Selective Switches (WSS) to investigate Software-Defined Networking (SDN) with machine learning-enabled flow prediction for reconfigurable optical Data Center Networks (DCNs). Our architecture provides highly scalable and flexible bandwidth allocation. Results from Mininet experimental simulations demonstrate that under the management of an SDN controller, machine learning traffic flow prediction and graph connectivity allow each optical bandwidth to be automatically reconfigured according to variable traffic patterns. The average server-to-server packet delay performance of the reconfigurable SROdcn improves by 42.33% compared to inflexible interconnects. Furthermore, the network performance of flexible SROdcn servers shows up to a 49.67% latency improvement over the Passive Optical Data Center Architecture (PODCA), a 16.87% latency improvement over the optical OPSquare DCN, and up to a 71.13% latency improvement over the fat-tree network. Additionally, our optimized Unsupervised Machine Learning (ML-UnS) method for SROdcn outperforms Supervised Machine Learning (ML-S) and Deep Learning (DL).","PeriodicalId":13202,"journal":{"name":"IEEE Transactions on Cloud Computing","volume":"13 1","pages":"245-258"},"PeriodicalIF":5.3,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143570759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing the Availability and Security of Attestation Scheme for Multiparty-Involved DLaaS: A Circular Approach","authors":"Miaomiao Yang;Guosheng Huang;Honghai Chen;Yongyi Liao;Qixu Wang;Xingshu Chen","doi":"10.1109/TCC.2024.3522993","DOIUrl":"https://doi.org/10.1109/TCC.2024.3522993","url":null,"abstract":"In this paper, we propose a remote attestation approach based on multiple verifiers named CARE. CARE aims to enhance the practicality and efficiency of remote attestation while addressing trust issues within environments involving multiple stakeholders. Specifically, CARE adopts the concept of swarm verification, and employs a circular collaboration model with multiple verifiers to collect and validate evidence, thereby resolving trust issues and enhancing verification efficiency. Moreover, CARE introduces a meticulously designed filtering mechanism to address the issue of false positives in verification outcomes non-invasively. CARE utilizes a multiway tree structure to construct the baseline value library, which enhances the flexibility and fine-grained management capability of the system. Security analysis indicates that CARE can effectively resist collusion attacks. Further, detailed simulation experiments have validated its capability to convincingly attest to the trustworthiness of the dynamically constructed environment. Notably, CARE is also suitable for the remote attestation of large-scale virtual machines, achieving an efficiency 9 times greater than the classical practice approach. To the best of our knowledge, CARE is the first practical solution to address inaccuracies in remote attestation results caused by the activation of Integrity Measurement Architecture (IMA) at the application layer.","PeriodicalId":13202,"journal":{"name":"IEEE Transactions on Cloud Computing","volume":"13 1","pages":"227-244"},"PeriodicalIF":5.3,"publicationDate":"2024-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143570598","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}