IEEE Transactions on Circuits and Systems for Video Technology最新文献

筛选
英文 中文
An Online-Training-Free Adaptor for Open Heterogeneous Collaborative Perception via Diffusion Model 一种基于扩散模型的开放式异构协同感知在线免培训适配器
IF 11.1 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2026-04-01 Epub Date: 2025-11-04 DOI: 10.1109/TCSVT.2025.3628726
Tianhang Wang;Fan Lu;Sanqing Qu;Bin Li;Ya Wu;Hu Cao;Alois Knoll;Guang Chen
{"title":"An Online-Training-Free Adaptor for Open Heterogeneous Collaborative Perception via Diffusion Model","authors":"Tianhang Wang;Fan Lu;Sanqing Qu;Bin Li;Ya Wu;Hu Cao;Alois Knoll;Guang Chen","doi":"10.1109/TCSVT.2025.3628726","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3628726","url":null,"abstract":"Collaborative perception seeks to mitigate the limitations of single-vehicle perception, such as occlusions, by facilitating communication and information sharing among connected vehicles. However, most existing works assume a homogeneous scenario where all vehicles share identity sensor types and perception model architectures. In contrast, real-world systems often involve heterogeneous agents with diverse sensor configurations and independently developed models. In such settings, directly exchanging features without proper alignment can significantly degrade performance and hinder effective collaboration. While some methods have been proposed to address heterogeneity, they typically require retraining or access to internal model parameters, making them impractical for scalable deployment. To address these challenges, we propose DiffAlign, a plug-and-play adapter that enables feature alignment across heterogeneous agents in a training-free and model-agnostic manner. DiffAlign treats received BEV features as noisy latent representations and progressively refines them through a pretrained diffusion process. This alignment strategy does not require access to model internals or any retraining, which makes it both scalable and privacy-preserving while supporting diverse sensor modalities and perception backbones. Extensive experiments on simulated OPV2V and real-world V2V4Real datasets demonstrate that DiffAlign consistently improves detection performance in heterogeneous settings, improving CoBEVT by 132.01% and 91.95%, respectively. Our method provides a practical path toward scalable, generalizable, and deployment-ready collaborative perception.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 4","pages":"5729-5741"},"PeriodicalIF":11.1,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147620925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ReFHD-Net: A Reversible Functionality Hiding Framework for Deep Neural Networks ReFHD-Net:深度神经网络的可逆功能隐藏框架
IF 11.1 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2026-04-01 Epub Date: 2025-11-14 DOI: 10.1109/TCSVT.2025.3632812
Na Wang;Pengpeng Li;Lin Huang;Fang Cao;Xinyi Wang;Chuan Qin
{"title":"ReFHD-Net: A Reversible Functionality Hiding Framework for Deep Neural Networks","authors":"Na Wang;Pengpeng Li;Lin Huang;Fang Cao;Xinyi Wang;Chuan Qin","doi":"10.1109/TCSVT.2025.3632812","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3632812","url":null,"abstract":"With the rapid development of artificial intelligence, deep neural networks (DNN) have become valuable digital assets, thereby highlighting the urgent need for copyright protection and secure transmission. Although traditional model watermarking and active defense techniques offer partial protection against unauthorized use, they often suffer from limited imperceptibility and may degrade model performance. To overcome these challenges, this paper proposes ReFHD-Net, a reversible functionality hiding framework for DNN based on a structured mask matrix. Here, reversible functionality hiding refers to the ability to hide the functionality of secret task within the stego model during transmission and enable its lossless recovery by authorized users at the receiver side. Specifically, ReFHD-Net employs a two-stage strategy to hide the secret functionality within a carrier model. In the first stage, a multi-task learning framework enhanced with homoscedastic uncertainty is employed to jointly train the model on both public and secret tasks. In the second stage, the model parameters are further optimized using a combination of task-driven loss and parameter distribution regularization, which limits parameter deviations caused by the hiding process and enhances the imperceptibility of the secret task. Experimental results on image classification and denoising benchmarks validate the superiority of our ReFHD-Net. It achieves an average degradation of only 0.27% in public task and enables lossless recovery of the secret task with no performance drop. Moreover, our framework exhibits strong robustness and security against various unauthorized recovery attempts including random guessing, fine-tuning, and model pruning.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 4","pages":"5683-5695"},"PeriodicalIF":11.1,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147620933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Semantic-Interactive Clustering Optimization With SAM for Weakly Supervised Person Search 基于SAM的弱监督人搜索语义交互聚类优化
IF 11.1 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2026-04-01 Epub Date: 2025-11-25 DOI: 10.1109/TCSVT.2025.3636572
Xi Yang;Hexun Zhou;De Cheng;Menghui Tian;Nannan Wang
{"title":"Semantic-Interactive Clustering Optimization With SAM for Weakly Supervised Person Search","authors":"Xi Yang;Hexun Zhou;De Cheng;Menghui Tian;Nannan Wang","doi":"10.1109/TCSVT.2025.3636572","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3636572","url":null,"abstract":"Weakly-supervised person search presents significant challenges when relying solely on bounding-box annotations, particularly due to inter-class confusion from clothing similarity and intra-class variations caused by illumination changes, which severely degrade cross-view matching accuracy. Existing clustering-based methods, constrained by their heavy dependence on color features, frequently produce unreliable pseudo-labels that ultimately limit model performance. To overcome these limitations, we present Segment Anything Model-based Semantic-Interactive Clustering Optimization (SAM-SICO), a novel framework that integrates the Segment Anything Model’s semantic segmentation capability with adaptive clustering optimization for weakly-supervised person search. Our framework harnesses the representational power of the Segment Anything Model (SAM) to enable detector-free semantic feature learning while significantly improving clustering precision. The proposed solution makes three key advances: the Semantic Contour Embedding (SCE) module leverages SAM’s zero-shot segmentation capability to produce highly accurate human body masks; the Relation-driven Semantic Feature Interaction (RSFI) mechanism effectively mitigates clothing-color bias through innovative dynamic affinity matrix construction across multiscale semantic masks and visual features; and the Adaptive Clustering Optimization (ACO) algorithm introduces parameter adaptation to optimize intra-class compactness and inter-class separation metrics. Experimental results show that our method outperforms existing state-of-the-art approaches on the PRW and CUHK-SYSU datasets. The source code is available at <uri>https://github.com//HawlsonZ/SAM-SICO</uri>","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 4","pages":"5642-5654"},"PeriodicalIF":11.1,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147620934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Pseudocylindrical Convolutions for Learned Omnidirectional Image Compression 用于学习全向图像压缩的伪圆柱卷积
IF 11.1 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2026-04-01 Epub Date: 2025-11-27 DOI: 10.1109/TCSVT.2025.3638018
Mu Li;Kede Ma;Jinxing Li;David Zhang
{"title":"Pseudocylindrical Convolutions for Learned Omnidirectional Image Compression","authors":"Mu Li;Kede Ma;Jinxing Li;David Zhang","doi":"10.1109/TCSVT.2025.3638018","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3638018","url":null,"abstract":"Equirectangular projection (ERP) is a convenient form to store omnidirectional images, but it is neither equal-area nor conformal, creating challenges for subsequent visual communication. When used for image compression, ERP amplifies sampling density and deforms objects near the poles, hindering perceptually optimal bit allocation. Here, we present one of the earliest endeavors to apply deep neural networks to omnidirectional image compression. We first propose parametric pseudocylindrical representations that generalize common pseudocylindrical map projections. A tractable greedy algorithm is introduced to identify (sub-)optimal representation configurations, guided by a proxy objective for rate-distortion performance. We then develop pseudocylindrical convolutions, which can be efficiently implemented by standard convolutions with “pseudocylindrical padding.” To demonstrate the utility of the proposed pseudocylindrical representations and convolutions, we implement an end-to-end omnidirectional image compression method, consisting of an analysis transform, a uniform quantizer, a synthesis transform, and an entropy model. Experiments show that our optimized method achieves consistently better rate-distortion performance compared to the state-of-the-art.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 4","pages":"5497-5509"},"PeriodicalIF":11.1,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147620935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mining Temporal Redundancy Using Long Short-Term Motion Aggregation and Global–Local Decorrelation for Learned Video Compression 基于长短期运动聚合和全局局部去相关的学习视频压缩时间冗余挖掘
IF 11.1 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2026-04-01 Epub Date: 2025-11-27 DOI: 10.1109/TCSVT.2025.3638161
Feng Yuan;Zhaoqing Pan;Jianjun Lei;Bo Peng;Haoran Xie;Fu Lee Wang;Sam Kwong
{"title":"Mining Temporal Redundancy Using Long Short-Term Motion Aggregation and Global–Local Decorrelation for Learned Video Compression","authors":"Feng Yuan;Zhaoqing Pan;Jianjun Lei;Bo Peng;Haoran Xie;Fu Lee Wang;Sam Kwong","doi":"10.1109/TCSVT.2025.3638161","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3638161","url":null,"abstract":"The conditional coding paradigm is widely used in learned video compression, which shows superior performance in capturing redundancies within a large context space. However, existing Conditional coding-based Learned Video Compression (C-LVC) methods ignore that the predicted motion vectors usually contain large uncertainty due to complex motions, occlusions, etc., which consequently decrease the accuracy of the generated temporal contexts. In addition, existing C-LVC methods have a weak ability to mine diverse dependencies within the context space, which are closely related to the coding efficiency. To address these issues, an efficient temporal redundancy mining method is proposed to improve the coding efficiency of C-LVC in this paper. To generate accurate temporal contexts, a Long Short-Term Motion Aggregation (LSTMA) model is proposed, in which an LSTMA-based motion estimation module is developed to capture both current and aggregated long short-term motion information to reduce the uncertainty of predicted motion vectors. Based on the dual motion information, an LSTMA-based temporal context mining module is developed to exploit the aggregated long short-term motion information and increase the accuracy of the generated temporal contexts. In order to fully eliminate spatial-temporal redundancies in a video, a Global-Local Information Decorrelation Module (GLIDM)-based context codec is proposed, in which the GLIDM is designed based on the visual state space block (namely vmamba), the residual block, and the squeeze-and-excitation block to effectively capture long-range, short-range spatial-temporal dependencies and channel-wise dependencies. Experimental results demonstrate that our proposed method can effectively improve the coding performance of C-LVC, and outperforms other state-of-the-art LVC methods.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 4","pages":"5510-5524"},"PeriodicalIF":11.1,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147620936","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Syntax Element Encryption for H.265/HEVC Using Chaotic Map-Based Coefficient Scrambling Scheme 基于混沌映射系数置乱方案的H.265/HEVC语法元加密
IF 11.1 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2026-04-01 Epub Date: 2025-10-23 DOI: 10.1109/TCSVT.2025.3625077
Liang-Wei Li;Chung-Nan Lee;Kishu Gupta;Huei-Fang Yang;Ashutosh Kumar Singh
{"title":"Syntax Element Encryption for H.265/HEVC Using Chaotic Map-Based Coefficient Scrambling Scheme","authors":"Liang-Wei Li;Chung-Nan Lee;Kishu Gupta;Huei-Fang Yang;Ashutosh Kumar Singh","doi":"10.1109/TCSVT.2025.3625077","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3625077","url":null,"abstract":"In today’s digital landscape, high-efficiency video coding (H.265/HEVC) has emerged as the most widely used video coding standard, employing selective encryption schemes to protect the privacy of video content while maintaining efficient compression performance. However, existing coefficient scrambling methods impose a significant computational load, leading to increased bit rate overhead due to encryption, longer execution times, and insufficient safety measures. To address these issues, a new coefficient scrambling scheme based on <italic>chaotic maps</i> is proposed. This approach leverages the pseudorandomness, ergodicity, and sensitivity to initial conditions inherent in chaotic maps to generate highly unpredictable coefficient distributions, thereby strengthening security while preserving low complexity. Unlike conventional scrambling, chaotic maps ensure minimal correlation between encrypted coefficients, enhancing resistance against statistical and differential attacks. Additionally, the scrambling conditions are specifically designed to minimize the impact on the bit rate overhead. Furthermore, when combined with syntax element encryption (SEC), which includes motion vector difference (MVD), quantized transform coefficients (QTC), and luma intraprediction mode (Luma IPM), this method effectively distorts video content. The proposed scheme operates synchronously with slices, ensuring that the decryption of video content remains intact even if some slices are lost. Additionally, a random sequence generated by AES-CTR is incorporated with the H.265 encoded stream to protect against chosen-plaintext attacks. The experimental results indicate that this scheme features high security, compliance with format standards, fast execution times, synchronous updates with slices, and resilience against common attacks, all while achieving a reduced bit rate overhead of 45.13% with a lowered average execution time overhead of 1.91%.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 4","pages":"5655-5670"},"PeriodicalIF":11.1,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147620908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Steganography via Neural Network Parameter Initialization With High Fidelity and Imperceptibility 基于神经网络参数初始化的高保真和隐蔽性隐写
IF 11.1 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2026-04-01 Epub Date: 2025-11-19 DOI: 10.1109/TCSVT.2025.3634666
Na Wang;Chenyi Xu;Fang Cao;Lin Huang;Wei Wang;Chuan Qin
{"title":"Steganography via Neural Network Parameter Initialization With High Fidelity and Imperceptibility","authors":"Na Wang;Chenyi Xu;Fang Cao;Lin Huang;Wei Wang;Chuan Qin","doi":"10.1109/TCSVT.2025.3634666","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3634666","url":null,"abstract":"In recent years, with the rapid advancement of deep neural networks (DNNs), researchers have explored steganography techniques that use DNN models as carriers for secret information hiding. However, existing methods generally suffer from limited imperceptibility and fidelity. To address these limitations, this paper proposes a steganography method that achieves high imperceptibility and fidelity while providing substantial embedding capacity and robustness. Specifically, we introduce a dual-branch encoder that embeds secret information into the initialization parameters of the cover model with almost no degradation of the model’s functionality. In addition, a new SMSE loss is employed to constrain the encoder output, which enhances the imperceptibility of the stego model. After training and transmission, the receiver can utilize a decoder to accurately extract secret information from the stego model. Experimental results demonstrate that the proposed method achieves a Kullback–Leibler (KL) divergence more than an order of magnitude lower than existing methods, with values ranging from 0.0003 to 0.007. The stego model preserves high fidelity to the original model, with classification accuracy differences within 0.005 on benchmark datasets including MNIST, CIFAR-10, and SST-2. In terms of embedding capacity, it achieves 319,312 bits on ResNet-18 and 1,757,952 bits on ViT, which exceeds the performance of baseline methods across most models. Furthermore, the proposed method exhibits strong robustness, as the embedded information can still be accurately recovered with BCH coding even under noise attacks at an SNR as low as -6 dB. It also demonstrates strong generalization, as it performs effectively on both classification networks and generative or reconstruction models such as GANs, VAEs, and U-Nets.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 4","pages":"5696-5713"},"PeriodicalIF":11.1,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147620948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ERDDCI: Exact Reversible Diffusion via Dual-Chain Inversion for High-Quality Image Editing ERDDCI:精确可逆扩散通过双链反演高质量的图像编辑
IF 11.1 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2026-04-01 Epub Date: 2025-11-28 DOI: 10.1109/TCSVT.2025.3638406
Jimin Dai;Yingzhen Zhang;Shuo Chen;Jian Yang;Lei Luo
{"title":"ERDDCI: Exact Reversible Diffusion via Dual-Chain Inversion for High-Quality Image Editing","authors":"Jimin Dai;Yingzhen Zhang;Shuo Chen;Jian Yang;Lei Luo","doi":"10.1109/TCSVT.2025.3638406","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3638406","url":null,"abstract":"Diffusion models (DMs) have been successfully applied to real image editing. These models typically invert images into latent noise vectors during the inversion process, and then edit them during the inference process. However, DMs often rely on the local linearization assumption, which assumes that the noise injected during the inversion process approximates the noise removed during the inference process. While DMs efficiently generate images under this assumption, it also accumulates errors during the diffusion process due to the assumption, ultimately negatively impacting the quality of real image reconstruction and editing. To address this issue, we propose a novel ERDDCI (Exact Reversible Diffusion via Dual-Chain Inversion). ERDDCI uses the new Dual-Chain Inversion (DCI) for joint inference to derive an exact reversible diffusion process. Using DCI, our method avoids the cumbersome optimization process in existing inversion approaches and achieves high-quality image editing. Additionally, to accommodate image operations under high guidance scales, we introduce a dynamic control strategy that enables more refined image reconstruction and editing. Our experiments demonstrate that ERDDCI significantly outperforms state-of-the-art methods in a 50-step diffusion process. It achieves rapid and precise image reconstruction with SSIM of 0.999 and LPIPS of 0.001, and delivers competitive results in image editing. The source code is available at: <uri>https://github.com/daii-y/ERDDCI</uri>","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 4","pages":"5437-5452"},"PeriodicalIF":11.1,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147620941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-Modal Generative AI: Multi-Modal LLMs, Diffusions, and the Unification 多模态生成人工智能:多模态llm、扩散和统一
IF 11.1 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2026-04-01 Epub Date: 2025-11-20 DOI: 10.1109/TCSVT.2025.3635224
Xin Wang;Yuwei Zhou;Bin Huang;Hong Chen;Wenwu Zhu
{"title":"Multi-Modal Generative AI: Multi-Modal LLMs, Diffusions, and the Unification","authors":"Xin Wang;Yuwei Zhou;Bin Huang;Hong Chen;Wenwu Zhu","doi":"10.1109/TCSVT.2025.3635224","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3635224","url":null,"abstract":"Multi-modal generative AI (Artificial Intelligence) has attracted increasing attention from both academia and industry. Particularly, two dominant families of techniques have emerged: i) Multi-modal large language models (LLMs) demonstrate impressive ability for <italic>multi-modal understanding</i>; and ii) Diffusion models exhibit remarkable multi-modal powers in terms of <italic>multi-modal generation</i>. Therefore, this paper provides a comprehensive overview of multi-modal generative AI, including multi-modal LLMs, diffusions, and the unification for understanding and generation. To lay a solid foundation for unified models, we first provide a detailed review of both multi-modal LLMs and diffusion models, respectively, including their probabilistic modeling procedure, multi-modal architecture design, and advanced applications to image/video LLMs as well as text-to-image/video generation. Furthermore, we explore the emerging efforts toward unified models for understanding and generation. To achieve the unification of understanding and generation, we investigate key designs including autoregressive-based and diffusion-based modeling, as well as dense and Mixture-of-Experts (MoE) architectures. We then introduce several strategies for unified models, analyzing their potential advantages and disadvantages. In addition, we summarize the common datasets widely used for multi-modal generative AI pretraining. Last but not least, we present several challenging future research directions that may contribute to the ongoing advancement of multi-modal generative AI.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 4","pages":"5621-5641"},"PeriodicalIF":11.1,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147620923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
DeskPred: Two-Stage Video Stream Bandwidth Prediction for Cold-Start and Training Forgetting in Cloud Desktops DeskPred:云桌面中冷启动和训练遗忘的两阶段视频流带宽预测
IF 11.1 1区 工程技术
IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2026-04-01 Epub Date: 2025-11-12 DOI: 10.1109/TCSVT.2025.3632222
Zuodong Jin;Dan Tao;Peng Qi;Jiayu Zhang;Gang Han;Ruipeng Gao
{"title":"DeskPred: Two-Stage Video Stream Bandwidth Prediction for Cold-Start and Training Forgetting in Cloud Desktops","authors":"Zuodong Jin;Dan Tao;Peng Qi;Jiayu Zhang;Gang Han;Ruipeng Gao","doi":"10.1109/TCSVT.2025.3632222","DOIUrl":"https://doi.org/10.1109/TCSVT.2025.3632222","url":null,"abstract":"As a cloud-hosted virtual desktop service, cloud desktop supports various fields such as telecommuting, collaborative development, while enabling real-time user interaction through video stream. The stability of this process is determined by bandwidth, which significantly influences the user experience. Therefore, precise bandwidth prediction of video streams is essential in cloud desktops. This work proposes DeskPred for video stream transmission in cloud desktops, focusing on dynamic bandwidth prediction. In the startup stage, the limited data amount poses a challenge for achieving precise bandwidth predictions. We propose an Affinity-based Federated Learning algorithm, which leverages the historical records of high-affinity users for assisted training, all while protecting user privacy. During the long-term adjustment stage, we propose a Fluctuation-based Adaptive Incremental Prediction algorithm for independent training to address the issue of pattern forgetting. The algorithm considers both periodic features and instantaneous features, incorporating new patterns while revisiting previous knowledge through the memory module and Adversarial Elastic Weight Consolidation. We have verified DeskPred through an actual cloud desktop project supported by Lenovo Research. Through experiments conducted on a total of over 18 million data items (approximately 10 GB), DeskPred achieves the highest total score of 71.11%, making it highly suitable for cloud desktop environments.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"36 4","pages":"5595-5607"},"PeriodicalIF":11.1,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147620927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信
小红书