Dynamic scale position embedding for cross-modal representation learning

IF 6.3 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Jungkyoo Shin, Sungmin Kang, Yoonsik Cho, Eunwoo Kim
{"title":"Dynamic scale position embedding for cross-modal representation learning","authors":"Jungkyoo Shin,&nbsp;Sungmin Kang,&nbsp;Yoonsik Cho,&nbsp;Eunwoo Kim","doi":"10.1016/j.neunet.2025.108087","DOIUrl":null,"url":null,"abstract":"<div><div>In this paper, we introduce a novel approach to capture temporal information in videos across multiple scales for cross-modal learning. As videos naturally encapsulate semantic information of diverse durations, existing methods that primarily depend on fine- and coarse-grained contrastive learning may fail to fully capture the inherent semantic information. To bridge this gap, we propose Dynamic Scale Position Embedding (DSPE), a novel approach that enables a single transformer to interpret videos at various temporal scales through dynamic adjustment of temporal position embedding. In contrast to conventional multi-scale methods that aggregate video clips, DSPE maintains the distinct features of each clip, thus preserving semantic integrity and enhancing semantic content comprehension. Based on this, we present an efficient multi-scale temporal encoder designed to adeptly capture temporal information across a broad spectrum from fine to coarse granularity. Comprehensive experiments across four datasets–MSR-VTT, LSMDC, MSVD, and ActivityNet-Captions–and two distinct tasks–text-video retrieval and video-captioning–with consistent performance improvements highlight the significance of the presented multi-scale approach.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"193 ","pages":"Article 108087"},"PeriodicalIF":6.3000,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893608025009670","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

In this paper, we introduce a novel approach to capture temporal information in videos across multiple scales for cross-modal learning. As videos naturally encapsulate semantic information of diverse durations, existing methods that primarily depend on fine- and coarse-grained contrastive learning may fail to fully capture the inherent semantic information. To bridge this gap, we propose Dynamic Scale Position Embedding (DSPE), a novel approach that enables a single transformer to interpret videos at various temporal scales through dynamic adjustment of temporal position embedding. In contrast to conventional multi-scale methods that aggregate video clips, DSPE maintains the distinct features of each clip, thus preserving semantic integrity and enhancing semantic content comprehension. Based on this, we present an efficient multi-scale temporal encoder designed to adeptly capture temporal information across a broad spectrum from fine to coarse granularity. Comprehensive experiments across four datasets–MSR-VTT, LSMDC, MSVD, and ActivityNet-Captions–and two distinct tasks–text-video retrieval and video-captioning–with consistent performance improvements highlight the significance of the presented multi-scale approach.

Abstract Image

跨模态表示学习的动态尺度位置嵌入
在本文中,我们介绍了一种新的方法来捕获跨多尺度视频中的时间信息,用于跨模式学习。由于视频自然地封装了不同持续时间的语义信息,现有的主要依赖于细粒度和粗粒度对比学习的方法可能无法完全捕获固有的语义信息。为了弥补这一差距,我们提出了动态尺度位置嵌入(DSPE),这是一种新颖的方法,通过动态调整时间位置嵌入,使单个变压器能够在不同的时间尺度上解释视频。与传统的聚合视频片段的多尺度方法相比,DSPE保留了每个视频片段的鲜明特征,从而保持了语义完整性并增强了语义内容的理解。基于此,我们提出了一种高效的多尺度时间编码器,旨在熟练地捕获从细到粗粒度的广谱时间信息。在四个数据集(msr - vtt、LSMDC、MSVD和activitynet - captions4)和两个不同的任务(文本-视频检索和视频字幕)上进行的综合实验显示,性能得到了一致的提高,突出了所提出的多尺度方法的重要性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Neural Networks
Neural Networks 工程技术-计算机:人工智能
CiteScore
13.90
自引率
7.70%
发文量
425
审稿时长
67 days
期刊介绍: Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信