文本视频检索的双分支尺度解纠缠

IF 3.3 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Hyunjoon Koo , Jungkyoo Shin , Eunwoo Kim
{"title":"文本视频检索的双分支尺度解纠缠","authors":"Hyunjoon Koo ,&nbsp;Jungkyoo Shin ,&nbsp;Eunwoo Kim","doi":"10.1016/j.patrec.2025.06.014","DOIUrl":null,"url":null,"abstract":"<div><div>In multi-modal understanding, text–video retrieval task, which aims to align videos with the corresponding texts, has gained increasing attention. Previous studies involved aligning fine-grained and coarse-grained features of videos and texts using a single model framework. However, the inherent differences between local and global features may result in entangled representations, leading to sub-optimal results. To address this issue, we introduce an approach to disentangle distinct modality features. Using a dual-branch structure, our method projects local and global features into distinct latent spaces. Each branch employs a different neural network and a loss function, facilitating independent learning of each feature and effectively capturing detailed and comprehensive features. We demonstrate the effectiveness of our method for text–video retrieval task across three different benchmarks, showing improvements over existing methods. It outperforms the compared methods by an average of +1.0%, +0.9%, and +0.6% in R@1 on MSR-VTT, LSMDC and MSVD, respectively</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"196 ","pages":"Pages 296-302"},"PeriodicalIF":3.3000,"publicationDate":"2025-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Dual-branch scale disentanglement for text–video retrieval\",\"authors\":\"Hyunjoon Koo ,&nbsp;Jungkyoo Shin ,&nbsp;Eunwoo Kim\",\"doi\":\"10.1016/j.patrec.2025.06.014\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>In multi-modal understanding, text–video retrieval task, which aims to align videos with the corresponding texts, has gained increasing attention. Previous studies involved aligning fine-grained and coarse-grained features of videos and texts using a single model framework. However, the inherent differences between local and global features may result in entangled representations, leading to sub-optimal results. To address this issue, we introduce an approach to disentangle distinct modality features. Using a dual-branch structure, our method projects local and global features into distinct latent spaces. Each branch employs a different neural network and a loss function, facilitating independent learning of each feature and effectively capturing detailed and comprehensive features. We demonstrate the effectiveness of our method for text–video retrieval task across three different benchmarks, showing improvements over existing methods. It outperforms the compared methods by an average of +1.0%, +0.9%, and +0.6% in R@1 on MSR-VTT, LSMDC and MSVD, respectively</div></div>\",\"PeriodicalId\":54638,\"journal\":{\"name\":\"Pattern Recognition Letters\",\"volume\":\"196 \",\"pages\":\"Pages 296-302\"},\"PeriodicalIF\":3.3000,\"publicationDate\":\"2025-07-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Pattern Recognition Letters\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0167865525002430\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pattern Recognition Letters","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0167865525002430","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

在多模态理解中,文本-视频检索任务越来越受到人们的关注,该任务旨在将视频与相应的文本对齐。先前的研究涉及使用单一模型框架来对齐视频和文本的细粒度和粗粒度特征。然而,局部特征和全局特征之间的固有差异可能导致纠缠表示,从而导致次优结果。为了解决这个问题,我们引入了一种分离不同模态特征的方法。使用双分支结构,我们的方法将局部和全局特征投影到不同的潜在空间中。每个分支使用不同的神经网络和损失函数,便于每个特征的独立学习,有效地捕获详细和全面的特征。我们通过三个不同的基准测试证明了我们的方法对文本视频检索任务的有效性,显示了对现有方法的改进。在MSR-VTT、LSMDC和MSVD方面,该方法在R@1上的平均性能分别优于比较方法+1.0%、+0.9%和+0.6%
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Dual-branch scale disentanglement for text–video retrieval
In multi-modal understanding, text–video retrieval task, which aims to align videos with the corresponding texts, has gained increasing attention. Previous studies involved aligning fine-grained and coarse-grained features of videos and texts using a single model framework. However, the inherent differences between local and global features may result in entangled representations, leading to sub-optimal results. To address this issue, we introduce an approach to disentangle distinct modality features. Using a dual-branch structure, our method projects local and global features into distinct latent spaces. Each branch employs a different neural network and a loss function, facilitating independent learning of each feature and effectively capturing detailed and comprehensive features. We demonstrate the effectiveness of our method for text–video retrieval task across three different benchmarks, showing improvements over existing methods. It outperforms the compared methods by an average of +1.0%, +0.9%, and +0.6% in R@1 on MSR-VTT, LSMDC and MSVD, respectively
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Pattern Recognition Letters
Pattern Recognition Letters 工程技术-计算机:人工智能
CiteScore
12.40
自引率
5.90%
发文量
287
审稿时长
9.1 months
期刊介绍: Pattern Recognition Letters aims at rapid publication of concise articles of a broad interest in pattern recognition. Subject areas include all the current fields of interest represented by the Technical Committees of the International Association of Pattern Recognition, and other developing themes involving learning and recognition.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信