IEEE Transactions on Multimedia最新文献

筛选
英文 中文
Cross-Projection Distilling Knowledge for Omnidirectional Image Quality Assessment 面向全方位图像质量评价的交叉投影提取知识
IF 9.7 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2025-07-28 DOI: 10.1109/TMM.2025.3590920
Huixin Hu;Feng Shao;Hangwei Chen;Xiongli Chai;Qiuping Jiang
{"title":"Cross-Projection Distilling Knowledge for Omnidirectional Image Quality Assessment","authors":"Huixin Hu;Feng Shao;Hangwei Chen;Xiongli Chai;Qiuping Jiang","doi":"10.1109/TMM.2025.3590920","DOIUrl":"https://doi.org/10.1109/TMM.2025.3590920","url":null,"abstract":"Nowadays, virtual reality technology is advancing rapidly and becoming increasingly matured. Omnidirectional images have integrated into the daily lives of many individuals. However, these images are susceptible to irreversible distortion during the encoding and transmission processes. Given the unique characteristics of deformation and distortion in omnidirectional images, the development of a quality assessment method is crucial. To ensure that our network not only delivers efficient and stable performance but also maintains a minimal parameter count, we have integrated the concept of knowledge distillation into our network. This involves utilizing a full-reference (FR) teacher network to guide the training of a no-reference (NR) student network by cross-projection distilling knowledge. To specifically implement this method, a Dual Projection Format Fusion (DPFF) module is specifically designed to complement and integrate the mutual fusion of the two projection formats of omnidirectional images. In the design of our knowledge distillation process and loss function, we have introduced a review mechanism to enhance the performance and efficiency of response-based knowledge, as well as utilized intermediate fusion features to improve the effectiveness of feature-based knowledge. These components are combined to formulate the final loss function. Experimental results validate the superiority of our proposed model over existing FR and NR methods when evaluated on four omnidirectional image databases. This highlights the effectiveness of our proposed model in elevating the quality assessment of omnidirectional images.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6752-6765"},"PeriodicalIF":9.7,"publicationDate":"2025-07-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Multi-Grained Vision-and-Language Model for Medical Image and Text Alignment 医学图像和文本对齐的多粒度视觉和语言模型
IF 9.7 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2025-07-23 DOI: 10.1109/TMM.2025.3590930
Huimin Yan;Xian Yang;Liang Bai;Jiamin Li;Jiye Liang
{"title":"Multi-Grained Vision-and-Language Model for Medical Image and Text Alignment","authors":"Huimin Yan;Xian Yang;Liang Bai;Jiamin Li;Jiye Liang","doi":"10.1109/TMM.2025.3590930","DOIUrl":"https://doi.org/10.1109/TMM.2025.3590930","url":null,"abstract":"The increasing interest in learning from paired medical images and textual reports highlights the need for methods that can achieve multi-grained alignment between these two modalities. However, most existing approaches overlook fine-grained semantic alignment, which can constrain the quality of the generated representations. To tackle this problem, we propose the Multi-Grained Vision-and-Language Alignment (MGVLA) model, which effectively leverages multi-grained correspondences between medical images and texts at different levels, including disease, instance, and token levels. For disease-level alignment, our approach adopts the concept of contrastive learning and uses medical terminologies detected from textual reports as soft labels to guide the alignment process. At the instance level, we propose a strategy for sampling hard negatives, where images and texts with the same disease type but differing in details such as disease locations and severity are considered as hard negatives. This strategy helps our approach to better distinguish between positive and negative image-text pairs, ultimately enhancing the quality of our learned representations. For token-level alignment, we employ a masking and recovery technique to achieve fine-grained semantic alignment between patches and sub-words. This approach effectively aligns the different levels of granularity between the image and language modalities. To assess the efficacy of our MGVLA model, we conduct comprehensive experiments on the image-text retrieval and phrase grounding tasks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6780-6792"},"PeriodicalIF":9.7,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
XMusic: Towards a Generalized and Controllable Symbolic Music Generation Framework XMusic:走向一个广义可控的符号音乐生成框架
IF 9.7 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2025-07-23 DOI: 10.1109/TMM.2025.3590912
Sida Tian;Can Zhang;Wei Yuan;Wei Tan;Wenjie Zhu
{"title":"XMusic: Towards a Generalized and Controllable Symbolic Music Generation Framework","authors":"Sida Tian;Can Zhang;Wei Yuan;Wei Tan;Wenjie Zhu","doi":"10.1109/TMM.2025.3590912","DOIUrl":"https://doi.org/10.1109/TMM.2025.3590912","url":null,"abstract":"In recent years, remarkable advancements in artificial intelligence-generated content (AIGC) have been achieved in the fields of image synthesis and text generation, generating content comparable to that produced by humans. However, the quality of AI-generated music has not yet reached this standard, primarily due to the challenge of effectively controlling musical emotions and ensuring high-quality outputs. This paper presents a generalized symbolic music generation framework, XMusic, which supports flexible prompts (i.e., images, videos, texts, tags, and humming) to generate emotionally controllable and high-quality symbolic music. XMusic consists of two core components, XProjector and XComposer. XProjector parses the prompts of various modalities into symbolic music elements (i.e., emotions, genres, rhythms and notes) within the projection space to generate matching music. XComposer contains a Generator and a Selector. The Generator generates emotionally controllable and melodious music based on our innovative symbolic music representation, whereas the Selector identifies high-quality symbolic music by constructing a multi-task learning scheme involving quality assessment, emotion recognition, and genre recognition tasks. In addition, we build XMIDI, a large-scale symbolic music dataset that contains 108,023 MIDI files annotated with precise emotion and genre labels. Objective and subjective evaluations show that XMusic significantly outperforms the current state-of-the-art methods with impressive music quality. Our XMusic has been awarded as one of the nine <italic>Highlights of Collectibles at WAIC 2023</i>. The project homepage of XMusic is: <uri>https://xmusic-project.github.io</uri>.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6857-6871"},"PeriodicalIF":9.7,"publicationDate":"2025-07-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
AFAN: An Attention-Driven Forgery Adversarial Network for Blind Image Inpainting AFAN:一种用于盲图像绘制的注意力驱动的伪造对抗网络
IF 9.7 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2025-07-22 DOI: 10.1109/TMM.2025.3590914
Jiahao Wang;Gang Pan;Di Sun;Jinyuan Li;Jiawan Zhang
{"title":"AFAN: An Attention-Driven Forgery Adversarial Network for Blind Image Inpainting","authors":"Jiahao Wang;Gang Pan;Di Sun;Jinyuan Li;Jiawan Zhang","doi":"10.1109/TMM.2025.3590914","DOIUrl":"https://doi.org/10.1109/TMM.2025.3590914","url":null,"abstract":"Blind image inpainting is a challenging task aimed at reconstructing corrupted regions without relying on mask information. Due to the lack of mask priors, previous methods usually integrate a mask prediction network in the initial phase, followed by an inpainting backbone. However, this multi-stage generation process may result in feature misalignment. While recent end-to-end generative methods bypass the mask prediction step, they typically struggle with weak perception of contaminated regions and introduce structural distortions. This study presents a novel mask region perception strategy for blind image inpainting by combining adversarial training with forgery detection. To implement this strategy, we propose an attention-driven forgery adversarial network (AFAN), which leverages adaptive contextual attention (ACA) blocks for effective feature modulation. Specifically, within the generator, ACA employs self-attention to enhance content reconstruction by utilizing the rich contextual information of adjacent tokens. In the discriminator, ACA utilizes cross-attention with noise priors to guide adversarial learning for forgery detection. Moreover, we design a high-frequency omni-dimensional dynamic convolution (HODC) based on edge feature enhancement to improve detail representation. Extensive evaluations across multiple datasets demonstrate that the proposed AFAN model outperforms existing generative methods in blind image inpainting, particularly in terms of quality and texture fidelity.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6845-6856"},"PeriodicalIF":9.7,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141604","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ROSA: A Robust Self-Adaptive Model for Multimodal Emotion Recognition With Uncertain Missing Modalities ROSA:一个具有不确定缺失模态的多模态情绪识别鲁棒自适应模型
IF 9.7 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2025-07-22 DOI: 10.1109/TMM.2025.3590929
Ziming Li;Yaxin Liu;Chuanpeng Yang;Yan Zhou;Songlin Hu
{"title":"ROSA: A Robust Self-Adaptive Model for Multimodal Emotion Recognition With Uncertain Missing Modalities","authors":"Ziming Li;Yaxin Liu;Chuanpeng Yang;Yan Zhou;Songlin Hu","doi":"10.1109/TMM.2025.3590929","DOIUrl":"https://doi.org/10.1109/TMM.2025.3590929","url":null,"abstract":"The rapid development of online media has heightened the importance of multimodal emotion recognition (MER) in video analysis. However, practical applications often encounter challenges due to missing modalities caused by various interferences. It is difficult to predict the specific missing situations, such as the number and types of missing modalities. Current approaches to modality missing typically apply a uniform method to address various missing cases, which are insufficiently adaptive to dynamic conditions. For example, translation-based methods can efficiently complete missing text from audio, but generating audio or video features that retain the original emotional information from other modalities is challenging and may introduce additional noise. In this paper, we introduce ROSA, a novel <bold>ro</b>bust <bold>s</b>elf-<bold>a</b>daptive model designed to address various missing cases with tailored approaches, leveraging available modalities effectively and reducing the introduction of additional noise. Specifically, the A-T Completion module based on the encoder-decoder architecture enables ROSA to generate missing raw text from audio rather than mere embedding representations, capturing more nuanced modal features. Additionally, we design the T-V Fusion module based on a vision-language large model for deep extraction and fusion of textual and visual features. Comprehensive experiments conducted on three widely used public datasets demonstrate the superiority and effectiveness of our model. ROSA outperforms other models in both fixed missing rate and fixed missing modality cases. The ablation studies further highlights the contribution of each designed module.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6766-6779"},"PeriodicalIF":9.7,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141607","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Image Super-Resolution With Taylor Expansion Approximation and Large Field Reception 图像超分辨率与泰勒展开近似和大视野接收
IF 9.7 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2025-07-21 DOI: 10.1109/TMM.2025.3590917
Jiancong Feng;Yuan-Gen Wang;Mingjie Li;Fengchuang Xing
{"title":"Image Super-Resolution With Taylor Expansion Approximation and Large Field Reception","authors":"Jiancong Feng;Yuan-Gen Wang;Mingjie Li;Fengchuang Xing","doi":"10.1109/TMM.2025.3590917","DOIUrl":"https://doi.org/10.1109/TMM.2025.3590917","url":null,"abstract":"Self-similarity techniques are booming in no-reference super-resolution (SR) due to accurate estimation of the degradation types involved in low-resolution images. However, high-dimensional matrix multiplication within self-similarity computation prohibitively consumes massive computational costs. We find that the high-dimensional attention map is derived from the matrix multiplication between query and key, followed by a softmax function. This softmax makes the matrix multiplication inseparable, posing a great challenge in simplifying computational complexity. To address this issue, we first propose a second-order Taylor expansion approximation (STEA) to separate the matrix multiplication of query and key, resulting in the complexity reduction from <inline-formula><tex-math>$mathcal {O}(N^{2})$</tex-math></inline-formula> to <inline-formula><tex-math>$mathcal {O}(N)$</tex-math></inline-formula>. Then, we design a multi-scale large field reception (MLFR) to compensate for the performance degradation caused by STEA. Finally, we apply these two core designs to laboratory and real-world scenarios by constructing LabNet and RealNet, respectively. Extensive experimental results tested on five synthetic datasets demonstrate that our LabNet sets a new benchmark in qualitative and quantitative evaluations. Tested on the real-world dataset, our RealNet achieves superior visual quality over existing methods. Ablation studies further verify the contributions of STEA and MLFR towards both LabNet and RealNet frameworks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6819-6830"},"PeriodicalIF":9.7,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards Student Actions in Classroom Scenes: New Dataset and Baseline 学生在课堂场景中的行为:新的数据集和基线
IF 9.7 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2025-07-21 DOI: 10.1109/TMM.2025.3590899
Zhuolin Tan;Chenqiang Gao;Anyong Qin;Ruixin Chen;Tiecheng Song;Feng Yang;Deyu Meng
{"title":"Towards Student Actions in Classroom Scenes: New Dataset and Baseline","authors":"Zhuolin Tan;Chenqiang Gao;Anyong Qin;Ruixin Chen;Tiecheng Song;Feng Yang;Deyu Meng","doi":"10.1109/TMM.2025.3590899","DOIUrl":"https://doi.org/10.1109/TMM.2025.3590899","url":null,"abstract":"Analyzing student actions is an important and challenging task in educational research. Existing efforts have been hampered by the lack of accessible datasets to capture the nuanced action dynamics in classrooms. In this paper, we present a new multi-label <italic>Student Action Video</i> (SAV) dataset, specifically designed for action detection in classroom settings. The SAV dataset consists of 4,324 carefully trimmed video clips from 758 different classrooms, annotated with 15 distinct student actions. Compared to existing action detection datasets, the SAV dataset stands out by providing a wide range of real classroom scenarios, high-quality video data, and unique challenges, including subtle movement differences, dense object engagement, significant scale differences, varied shooting angles, and visual occlusion. These complexities introduce new opportunities and challenges to advance action detection methods. To benchmark this, we propose a novel baseline method based on a visual transformer, designed to enhance attention to key local details within small and dense object regions. Our method demonstrates excellent performance with a mean Average Precision (mAP) of 67.9% and 27.4% on the SAV and AVA datasets, respectively. This paper not only provides the dataset but also calls for further research into AI-driven educational tools that may transform teaching methodologies and learning outcomes.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6831-6844"},"PeriodicalIF":9.7,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141670","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Single-Domain Generalized Object Detection With Frequency Whitening and Contrastive Learning 基于频率白化和对比学习的单域广义目标检测
IF 9.7 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2025-07-21 DOI: 10.1109/TMM.2025.3590915
Xiaolong Guo;Chengxu Liu;Xueming Qian;Zhixiao Wang;Xubin Feng;Yao Xue
{"title":"Single-Domain Generalized Object Detection With Frequency Whitening and Contrastive Learning","authors":"Xiaolong Guo;Chengxu Liu;Xueming Qian;Zhixiao Wang;Xubin Feng;Yao Xue","doi":"10.1109/TMM.2025.3590915","DOIUrl":"https://doi.org/10.1109/TMM.2025.3590915","url":null,"abstract":"Single-Domain Generalization Object Detection (Single-DGOD) refers to training a model with only one source domain, enabling the model to generalize to any unseen domain. For instance, a detector trained on a sunny daytime dataset should also perform well in scenarios such as rainy nighttime. The main challenge is to improve the detector’s ability to learn the domain-invariant representation (DIR) while removing domain-specific information. Recent progress in Single-DGOD has demonstrated the efficacy of removing domain-specific information by adjusting feature distributions. Nonetheless, simply adjusting the global feature distribution in Single-DGOD task is insufficient to learn the potential relationship from sunny to adverse weather, as these ignore the significant domain gaps between instances across different weathers. In this paper, we propose a novel object detection method for more robust single-domain generalization. In particular, it mainly consists of a frequency-aware selective whitening module (FSW) for removing redundant domain-specific information and a contrastive feature alignment module (CFA) for enhancing domain-invariant information among instances. Specially, FSW extracts the magnitude spectrum of the feature and uses a group whitening loss to selectively eliminate redundant domain-specific information in the magnitude. To further eliminate domain differences among instances, we apply the style transfer method for data augmentation and use the augmented data in the CFA module. CFA formulates both the original and the augmentd RoI features into a series of groups with different categories, and utilizes contrastive learning across them to facilitate the learning of DIR in various categories. Experiments show that our method achieves favorable performance on existing standard benchmarks.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6805-6818"},"PeriodicalIF":9.7,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Boosting Modal-Specific Representations for Sentiment Analysis With Incomplete Modalities 基于不完全模态的情感分析中情态特定表示的增强
IF 9.7 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2025-07-21 DOI: 10.1109/TMM.2025.3590909
Xin Jiang;Lihuo He;Fei Gao;Kaifan Zhang;Jie Li;Xinbo Gao
{"title":"Boosting Modal-Specific Representations for Sentiment Analysis With Incomplete Modalities","authors":"Xin Jiang;Lihuo He;Fei Gao;Kaifan Zhang;Jie Li;Xinbo Gao","doi":"10.1109/TMM.2025.3590909","DOIUrl":"https://doi.org/10.1109/TMM.2025.3590909","url":null,"abstract":"Multimodal sentiment analysis aims at exploiting complementary information from multiple modalities or data sources to enhance the understanding and interpretation of sentiment. While existing multi-modal fusion techniques offer significant improvements in sentiment analysis, real-world scenarios often involve missing modalities, introducing complexity due to uncertainty of which modalities may be absent. To tackle the challenge of incomplete modality-specific feature extraction caused by missing modalities, this paper proposes a Cosine Margin-Aware Network (CMANet) which centers on the Cosine Margin-Aware Distillation (CMAD) module. The core module measures distance between samples and the classification boundary, enabling CMANet to focus on samples near the boundary. So, it effectively captures the unique features of different modal combinations. To address the issue of modality imbalance during modality-specific feature extraction, this paper proposes a Weak Modality Regularization (WMR) strategy, which aligns the feature distributions between strong and weak modalities at the dataset-level, while also enhancing the prediction loss of samples at the sample-level. This dual mechanism improves the recognition robustness of weak modality combination. Extensive experiments demonstrate that the proposed method outperforms the previous best model, MMIN, with a 3.82% improvement in unweighted accuracy. These results underscore the robustness of the approach under conditions of uncertain and missing modalities.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6793-6804"},"PeriodicalIF":9.7,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145210117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards Efficient Partially Relevant Video Retrieval With Active Moment Discovering 基于主动矩发现的部分相关视频高效检索
IF 9.7 1区 计算机科学
IEEE Transactions on Multimedia Pub Date : 2025-07-21 DOI: 10.1109/TMM.2025.3590937
Peipei Song;Long Zhang;Long Lan;Weidong Chen;Dan Guo;Xun Yang;Meng Wang
{"title":"Towards Efficient Partially Relevant Video Retrieval With Active Moment Discovering","authors":"Peipei Song;Long Zhang;Long Lan;Weidong Chen;Dan Guo;Xun Yang;Meng Wang","doi":"10.1109/TMM.2025.3590937","DOIUrl":"https://doi.org/10.1109/TMM.2025.3590937","url":null,"abstract":"Partially relevant video retrieval (PRVR) is a practical yet challenging task in text-to-video retrieval, where videos are untrimmed and contain much background content. The pursuit here is of both effective and efficient solutions to capture the partial correspondence between text queries and untrimmed videos. Existing PRVR methods, which typically focus on modeling multi-scale clip representations, however, suffer from content independence and information redundancy, impairing retrieval performance. To overcome these limitations, we propose a simple yet effective approach with active moment discovering (AMDNet). We are committed to discovering video moments that are semantically consistent with their queries. By using learnable span anchors to capture distinct moments and applying masked multi-moment attention to emphasize salient moments while suppressing redundant backgrounds, we achieve more compact and informative video representations. To further enhance moment modeling, we introduce a moment diversity loss to encourage different moments of distinct regions and a moment relevance loss to promote semantically query-relevant moments, which cooperate with a partially relevant retrieval loss for end-to-end optimization. Extensive experiments on two large-scale video datasets (<italic>i.e</i>., TVR and ActivityNet Captions) demonstrate the superiority and efficiency of our AMDNet. In particular, AMDNet is about 15.5 times smaller (#parameters) while 6.0 points higher (SumR) than the up-to-date method GMMFormer on TVR.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"6740-6751"},"PeriodicalIF":9.7,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信