IEEE Transactions on Multimedia最新文献

筛选
英文 中文
Screen Detection from Egocentric Image Streams Leveraging Multi-View Vision Language Model. 基于多视图视觉语言模型的以自我为中心的图像流屏幕检测。
IF 9.7 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2026-02-10 DOI: 10.1109/tmm.2026.3660180
Xueshen Li, Sen Shen, Xinlong Hou, Xinran Gao, Ziyi Huang, Steven J Holiday, Matthew R Cribbet, Susan W White, Edward Sazonov, Yu Gan
{"title":"Screen Detection from Egocentric Image Streams Leveraging Multi-View Vision Language Model.","authors":"Xueshen Li, Sen Shen, Xinlong Hou, Xinran Gao, Ziyi Huang, Steven J Holiday, Matthew R Cribbet, Susan W White, Edward Sazonov, Yu Gan","doi":"10.1109/tmm.2026.3660180","DOIUrl":"10.1109/tmm.2026.3660180","url":null,"abstract":"<p><p>Accurately monitoring the screen exposure of young children is important for research related to screen use, such as childhood obesity, physical activity, and social interaction. Most existing studies rely upon self-report or manual measures from bulky wearable sensors, thus lacking efficiency and accuracy in capturing quantitative screen exposure data. In this work, we developed a novel screen detection framework that utilizes egocentric images from a wearable sensor, named the screen time tracker (STT), and a vision language model (VLM). In particular, we devised a multi-view VLM that takes multiple views from egocentric image streams and interprets screen exposure dynamically. We validated our approach by using a dataset of children's free-living activities, demonstrating significant improvement over existing methods in conventional vision language models and object detection models. The combination of a vision language model and a lightweight hardware design provides a novel solution in screen detection for children. The proposed framework has great potential to benefit children's behavioral study. The code is available at https://github.com/YGanLab/MV-VLM.</p>","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":" ","pages":""},"PeriodicalIF":9.7,"publicationDate":"2026-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12893618/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146179422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TMT: Tri-Modal Translation Between Speech, Image, and Text by Processing Different Modalities as Different Languages TMT:将不同模态处理为不同语言的语音、图像和文本的三模态翻译
IF 9.7 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2026-01-29 DOI: 10.1109/TMM.2026.3659297
Minsu Kim;Jee-weon Jung;Hyeongseop Rha;Soumi Maiti;Siddhant Arora;Xuankai Chang;Shinji Watanabe;Yong Man Ro
{"title":"TMT: Tri-Modal Translation Between Speech, Image, and Text by Processing Different Modalities as Different Languages","authors":"Minsu Kim;Jee-weon Jung;Hyeongseop Rha;Soumi Maiti;Siddhant Arora;Xuankai Chang;Shinji Watanabe;Yong Man Ro","doi":"10.1109/TMM.2026.3659297","DOIUrl":"https://doi.org/10.1109/TMM.2026.3659297","url":null,"abstract":"The capability to jointly process multi-modal information is becoming essential. However, the development of multi-modal learning is hindered by the substantial computational requirements and the limited availability of paired multi-modal data. We propose a novel Tri-Modal Translation (TMT) model that translates between arbitrary modalities spanning speech, image, and text. We introduce a simple yet efficient and effective approach, treating speech and image modalities as discrete text modality and approaching multi-modal translation as a well-established machine translation problem. To this end, we tokenize speech and image data into discrete tokens, resulting in a significant reduction in computational cost. Furthermore, by incorporating back translation into multi-modal translation, unpaired data can also be utilized for training. TMT can perform six modality translation tasks and consistently outperforms its single-model counterparts. TMT significantly reduces the required data size (in bits) for training, to approximately 0.2% for speech data and 0.04% for image data, respectively.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"1976-1988"},"PeriodicalIF":9.7,"publicationDate":"2026-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147362288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
HMS2Net: Heterogeneous Multimodal State Space Network via CLIP for Dynamic Scene Classification in Livestreaming HMS2Net:基于CLIP的异构多模态空间网络用于直播动态场景分类
IF 9.7 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2026-01-12 DOI: 10.1109/TMM.2025.3632629
Wensheng Li;Jing Zhang;Li Zhuo;Qi Tian
{"title":"HMS2Net: Heterogeneous Multimodal State Space Network via CLIP for Dynamic Scene Classification in Livestreaming","authors":"Wensheng Li;Jing Zhang;Li Zhuo;Qi Tian","doi":"10.1109/TMM.2025.3632629","DOIUrl":"https://doi.org/10.1109/TMM.2025.3632629","url":null,"abstract":"Livestreaming platforms attract countless daily active users, making online content regulation imperative. The complex and diverse multimodal content elements in dynamic livestreaming scene pose a great challenge to video content understanding. Thanks to the success of contrastive language-image pre-training (CLIP) for dynamic scene classification, which is one of the basic tasks of video content understanding. We propose a heterogeneous multimodal state space network (HMS<sup>2</sup>Net) for dynamic scene classification in livestreaming via CLIP. (1) To fully and efficiently mine the dynamic scene elements in livestreaming, we design a heterogeneous teacher-student Transformer (HT-SFormer) with CLIP to extract multimodal features in an energy-efficient unified pipeline; (2) To cope with the possible information conflicts in heterogeneous feature fusion, we introduce a cross-modal adaptive feature filter and fusion (CMAF) module to generate more complete information complementarity by adjusting multimodal feature composition; (3) For temporal context-awareness of dynamic scene, we establish a dynamic state space memory (DSSM) structure for capturing the correlation of multimodal data between neighboring video frames. A series of comparative experiments are conducted on the publicly available datasets DAVIS, Mini-kinetics, HMDB51, and the self-built BJUT-LCD. Our HMS<sup>2</sup>Net produce competitive results of 71.09%, 95.40%, 53.64%, and 82.36%, respectively, demonstrating the effectiveness and superiority of dynamic scene classification in livestreaming.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"772-785"},"PeriodicalIF":9.7,"publicationDate":"2026-01-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145982358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dual-Supervised Asymmetric Co-Training for Semi-Supervised Medical Domain Generalization 半监督医学领域泛化的双监督非对称协同训练
IF 9.7 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2026-01-01 Epub Date: 2025-09-22 DOI: 10.1109/TMM.2025.3613080
Jincai Song;Haipeng Chen;Jun Qin;Na Zhao
{"title":"Dual-Supervised Asymmetric Co-Training for Semi-Supervised Medical Domain Generalization","authors":"Jincai Song;Haipeng Chen;Jun Qin;Na Zhao","doi":"10.1109/TMM.2025.3613080","DOIUrl":"https://doi.org/10.1109/TMM.2025.3613080","url":null,"abstract":"Semi-supervised domain generalization (SSDG) in medical image segmentation offers a promising solutionfor generalizing to unseen domains during testing, addressing domain shift challenges and minimizing annotation costs. However, conventional SSDG methods assume labeled and unlabeled data are available for each source domain in the training set, a condition that is not always met in practice. The coexistence of limited annotation and domain shift in the training set is a prevalent issue. Thus, this paper explores a more practical and challenging scenario, cross-domain semi-supervised domain generalization (CD-SSDG), where domain shifts occur between labeled and unlabeled training data, in addition to shifts between training and testing sets. Existing SSDG methods exhibit sub-optimal performance under such domain shifts because of inaccurate pseudo-labels. To address this issue, we propose a novel dual-supervised asymmetric co-training (DAC) framework tailored for CD-SSDG. Building upon the co-training paradigm with two sub-models offering cross pseudo supervision, our DAC framework integrates extra feature-level supervision and asymmetric auxiliary tasks for each sub-model. This feature-level supervision serves to address inaccurate pseudo supervision caused by domain shifts between labeled and unlabeled data, utilizing complementary supervision from the rich feature space. Additionally, two distinct auxiliary self-supervised tasks are integrated into each sub-model to enhance domain-invariant discriminative feature learning and prevent model collapse. Extensive experiments on real-world medical image segmentation datasets, i.e., Fundus, Polyp, and SCGM, demonstrate the robust generalizability of the proposed DAC framework.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"2159-2171"},"PeriodicalIF":9.7,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147557437","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Weakly Semi-Supervised Temporal Sentence Grounding in Videos With Point Annotations 带点注释视频的弱半监督时态句基础
IF 9.7 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2026-01-01 Epub Date: 2026-01-06 DOI: 10.1109/TMM.2026.3651062
Jianxiang Dong;Zhaozheng Yin
{"title":"Weakly Semi-Supervised Temporal Sentence Grounding in Videos With Point Annotations","authors":"Jianxiang Dong;Zhaozheng Yin","doi":"10.1109/TMM.2026.3651062","DOIUrl":"https://doi.org/10.1109/TMM.2026.3651062","url":null,"abstract":"Temporal Sentence Grounding (TSG) in videos aims to localize a temporal interval from an untrimmed video that is semantically relevant to a given query sentence. To achieve a balance between tremendous annotation burden and grounding performance, we propose a new Weakly Semi-supervised Temporal Sentence Grounding with Points (WSS-TSG-P) task, where the dataset comprises limited fully-annotated video-sentence pairs by start and end timestamps (full label) and a large amount of weakly-annotated pairs by a single point timestamp (point label). Based on this setting, we first introduce a point-to-moment<sup>1</sup> regressor which converts point annotations to pseudo moment labels. To train a good regressor for reliable pseudo moment labels, we propose a point-guided feature aggregation module to aggregate cross-modal representations based on the prototype feature at the given point position. In addition, we propose to perform regressor self-training and design pseudo label generation strategies to exploit both full annotations and point annotations. All heterogeneous labels (full, pseudo moment, and point labels) are used to train a TSG backbone. In addition, we propose a novel point-guided group contrastive learning method by constructing reliable positive and negative sets and re-weighting pseudo moment labels to further improve the model performance. Extensive experiments on benchmark datasets verify that our proposed method outperforms other semi-supervised learning methods and bridges the performance gap between weakly-supervised and fully-supervised learning methods in TSG.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"2268-2278"},"PeriodicalIF":9.7,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147557439","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hierarchy-Aware Multimodal Distillation for Recommendation 基于层次感知的多模式推荐蒸馏
IF 9.7 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2026-01-01 Epub Date: 2026-01-12 DOI: 10.1109/TMM.2026.3651049
Meng Jian;Tuo Wang;Meijuan Yang;Lifang Wu
{"title":"Hierarchy-Aware Multimodal Distillation for Recommendation","authors":"Meng Jian;Tuo Wang;Meijuan Yang;Lifang Wu","doi":"10.1109/TMM.2026.3651049","DOIUrl":"https://doi.org/10.1109/TMM.2026.3651049","url":null,"abstract":"Beyond behavioral interaction records, multimedia recommendation scenarios possess abundant semantic signals, which provide excellent data support for user interest mining. Recently, the multimodal enhanced interaction graph has been actively explored and has achieved great progress. However, these methods overlook the capability disparity of various modalities in learning users' interests and lack the ability to explore the hierarchical relationships of interests in modality, resulting in suboptimal recommendation performance. Therefore, this work investigates intra-modality hierarchical learning and inter-modality guidance, proposing a hyperbolic self-distillation (HSD) model for multimedia recommendation. In each modality space, HSD introduces a hyperbolic propagation to filter users' hierarchical interests from the interaction graph effectively. Inter-modality interests are aligned further by a two-level self-distillation strategy to designate multimodal interactions to teach single-modal learning, aiming at teaching and learning to promote each other. Extensive experiments on four public datasets demonstrate that the proposed HSD outperforms leading baselines for multimedia recommendation, verifying the effectiveness of hierarchical propagation and two-level self-distillation in mining users' hierarchical interests.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"2279-2290"},"PeriodicalIF":9.7,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147557759","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
TextRSR: Enhanced Arbitrary-Shaped Scene Text Representation via Robust Subspace Recovery texttrsr:基于鲁棒子空间恢复的增强任意形状场景文本表示
IF 9.7 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2026-01-01 Epub Date: 2026-01-06 DOI: 10.1109/TMM.2026.3651034
Zhiwen Shao;Shengtian Jiang;Hancheng Zhu;Xuehuai Shi;Canlin Li;Lizhuang Ma;Dit-Yan Yeung
{"title":"TextRSR: Enhanced Arbitrary-Shaped Scene Text Representation via Robust Subspace Recovery","authors":"Zhiwen Shao;Shengtian Jiang;Hancheng Zhu;Xuehuai Shi;Canlin Li;Lizhuang Ma;Dit-Yan Yeung","doi":"10.1109/TMM.2026.3651034","DOIUrl":"https://doi.org/10.1109/TMM.2026.3651034","url":null,"abstract":"In recent years, scene text detection research has increasingly focused on arbitrary-shaped texts, where text representation is a fundamental problem. However, most existing methods still struggle to separate adjacent or overlapping texts due to ambiguous spatial positions of points or segmentation masks. Besides, the time efficiency of the entire pipeline is often neglected, resulting in sub-optimal inference speed. To tackle these problems, we first propose a novel text representation method based on robust subspace recovery, which robustly represents complex text shapes by combining orthogonal basis vectors learned from labeled text contours. These basis vectors capture basis contour patterns with distinct information, enabling clearer boundaries even in densely populated text scenarios. Moreover, we propose a dynamic sparse assignment scheme for positive samples that adaptively adjusts their weights during training, which not only accelerates inference speed by eliminating redundant predictions but also enhances feature learning by providing sufficient supervision signals. Building on these innovations, we present TextRSR, an accurate and efficient scene text detection network. Extensive experiments on challenging benchmarks demonstrate the superior accuracy and efficiency of TextRSR compared to state-of-the-art methods. Particularly, TextRSR achieves an F-measure of 88.5% at 37.8 frames per second (FPS) for CTW1500 dataset and an F-measure of 89.1% at 23.1 FPS for Total-Text dataset.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"2550-2563"},"PeriodicalIF":9.7,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147665450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RefHCM: A Unified Model for Referring Perceptions in Human-Centric Scenarios RefHCM:在以人为中心的场景中引用感知的统一模型
IF 9.7 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2026-01-01 Epub Date: 2026-01-12 DOI: 10.1109/TMM.2026.3651042
Jie Huang;Ruibing Hou;Jiahe Zhao;Hong Chang;Shiguang Shan
{"title":"RefHCM: A Unified Model for Referring Perceptions in Human-Centric Scenarios","authors":"Jie Huang;Ruibing Hou;Jiahe Zhao;Hong Chang;Shiguang Shan","doi":"10.1109/TMM.2026.3651042","DOIUrl":"https://doi.org/10.1109/TMM.2026.3651042","url":null,"abstract":"Human-centric perceptions play a crucial role in real-world applications. While recent human-centric works have achieved impressive progress, these efforts are often constrained to the visual domain and lack interaction with human instructions, limiting their applicability in broader scenarios such as chatbots and sports analysis. This paper introduces <italic>Referring Human Perceptions</i>, where a referring prompt specifies the person of interest in an image. To tackle the new task, we propose RefHCM (<bold>Ref</b>erring <bold>H</b>uman-<bold>C</b>entric <bold>M</b>odel), a unified framework to integrate a wide range of human-centric referring tasks. Specifically, RefHCM employs sequence mergers to convert raw multimodal data—including images, text, coordinates, and parsing maps—into semantic tokens. This standardized representation enables RefHCM to reformulate diverse human-centric referring tasks into a sequence-to-sequence paradigm, solved using a plain encoder-decoder transformer architecture. Benefiting from a unified learning strategy, RefHCM effectively facilitates knowledge transfer across tasks and exhibits unforeseen capabilities in handling complex reasoning. This work represents the first attempt to address referring human perceptions with a general-purpose framework, while simultaneously establishing a corresponding benchmark that sets new standards for the field. Extensive experiments showcase RefHCM’s competitive and even superior performance across multiple human-centric referring tasks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"2445-2459"},"PeriodicalIF":9.7,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147665481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ASR-Enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval 基于asr的多模态表示学习跨领域产品检索
IF 9.7 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2026-01-01 Epub Date: 2026-01-06 DOI: 10.1109/TMM.2026.3651039
Ruixiang Zhao;Jian Jia;Yan Li;Xuehan Bai;Quan Chen;Han Li;Peng Jiang;Xirong Li
{"title":"ASR-Enhanced Multimodal Representation Learning for Cross-Domain Product Retrieval","authors":"Ruixiang Zhao;Jian Jia;Yan Li;Xuehan Bai;Quan Chen;Han Li;Peng Jiang;Xirong Li","doi":"10.1109/TMM.2026.3651039","DOIUrl":"https://doi.org/10.1109/TMM.2026.3651039","url":null,"abstract":"E-commerce is increasingly <italic>multimedia</i>-enriched, with products exhibited in a broad-domain manner as images, short videos, or live stream promotions. A unified and vectorized cross-domain production representation is essential. Due to large intra-product variance and high inter-product similarity in the broad-domain scenario, a visual-only representation is inadequate. While Automatic Speech Recognition (ASR) text derived from the short or live-stream videos is readily accessible, how to de-noise the excessively noisy text for multimodal representation learning is mostly untouched. We propose <underline>A</u>SR-enhanced <underline>M</u>ultimodal <underline>P</u>roduct R<underline>e</u>p<underline>r</u>esentation L<underline>e</u>arning (<monospace>AMPere</monospace>). In order to extract product-specific information from the raw ASR text, <monospace>AMPere</monospace> uses an easy-to-implement LLM-based ASR text summarizer. The LLM-summarized text, together with visual data, is then fed into a multi-branch network to generate compact multimodal embeddings. Extensive experiments on a large-scale tri-domain dataset verify the effectiveness of <monospace>AMPere</monospace> in obtaining a unified multimodal product representation that clearly improves cross-domain product retrieval.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"2618-2629"},"PeriodicalIF":9.7,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147665504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Adaptive Use of Convex or Non-Convex Optimization in Deep Unfolding Network for Image Compressive Sensing 深度展开网络中凸或非凸优化的自适应应用
IF 9.7 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2026-01-01 Epub Date: 2026-01-12 DOI: 10.1109/TMM.2026.3651069
Chen Liao;Yan Shen;Zhongli Wang;Yanbing Li
{"title":"Adaptive Use of Convex or Non-Convex Optimization in Deep Unfolding Network for Image Compressive Sensing","authors":"Chen Liao;Yan Shen;Zhongli Wang;Yanbing Li","doi":"10.1109/TMM.2026.3651069","DOIUrl":"https://doi.org/10.1109/TMM.2026.3651069","url":null,"abstract":"Recently, deep unfolding networks (DUNs) have emerged as a promising technique for image Compressive Sensing (CS) reconstruction by unfolding optimization algorithms, where each stage of the DUNs corresponds to an iteration of the optimization algorithm. DUNs can be divided into convex optimization based methods and non-convex optimization based methods. On the one hand, DUNs based on convex optimization algorithms cannot handle non-convex optimization problems, thereby limiting their use when the prior term is a non-convex function. On the other hand, although DUNs based on non-convex optimization algorithms can handle more complex prior terms to make global optimal solutions closer to the ground truth, there is a high probability that they converge only to a local optimum. Therefore, in practical applications, it is necessary to consider the various characteristics of the problem comprehensively, then design appropriate prior terms and choose convex or non-convex optimization in DUN. This paper proposes ViP-DUN method to learn suitable prior terms and adaptively use convex or non-convex optimization. ViP-DUN learns deep prior terms and variable metrics in a data-driven manner to achieve adaptive use of convex or non-convex optimization. Moreover, we designed a lightweight multi-scale information fusion module in ViP-DUN at the network structure level to further enhance the network’s processing capability. Experiments demonstrate that our proposed method can improve image reconstruction quality at multiple compression rates through the adaptive capabilities of the network.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"28 ","pages":"2850-2864"},"PeriodicalIF":9.7,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147696624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信
小红书