Dynamic scale position embedding for cross-modal representation learning

IF 6.3 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Neural Networks Pub Date : 2025-09-08 DOI:10.1016/j.neunet.2025.108087

Jungkyoo Shin, Sungmin Kang, Yoonsik Cho, Eunwoo Kim

{"title":"Dynamic scale position embedding for cross-modal representation learning","authors":"Jungkyoo Shin, Sungmin Kang, Yoonsik Cho, Eunwoo Kim","doi":"10.1016/j.neunet.2025.108087","DOIUrl":null,"url":null,"abstract":"<div><div>In this paper, we introduce a novel approach to capture temporal information in videos across multiple scales for cross-modal learning. As videos naturally encapsulate semantic information of diverse durations, existing methods that primarily depend on fine- and coarse-grained contrastive learning may fail to fully capture the inherent semantic information. To bridge this gap, we propose Dynamic Scale Position Embedding (DSPE), a novel approach that enables a single transformer to interpret videos at various temporal scales through dynamic adjustment of temporal position embedding. In contrast to conventional multi-scale methods that aggregate video clips, DSPE maintains the distinct features of each clip, thus preserving semantic integrity and enhancing semantic content comprehension. Based on this, we present an efficient multi-scale temporal encoder designed to adeptly capture temporal information across a broad spectrum from fine to coarse granularity. Comprehensive experiments across four datasets–MSR-VTT, LSMDC, MSVD, and ActivityNet-Captions–and two distinct tasks–text-video retrieval and video-captioning–with consistent performance improvements highlight the significance of the presented multi-scale approach.</div></div>","PeriodicalId":49763,"journal":{"name":"Neural Networks","volume":"193 ","pages":"Article 108087"},"PeriodicalIF":6.3000,"publicationDate":"2025-09-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Neural Networks","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0893608025009670","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

In this paper, we introduce a novel approach to capture temporal information in videos across multiple scales for cross-modal learning. As videos naturally encapsulate semantic information of diverse durations, existing methods that primarily depend on fine- and coarse-grained contrastive learning may fail to fully capture the inherent semantic information. To bridge this gap, we propose Dynamic Scale Position Embedding (DSPE), a novel approach that enables a single transformer to interpret videos at various temporal scales through dynamic adjustment of temporal position embedding. In contrast to conventional multi-scale methods that aggregate video clips, DSPE maintains the distinct features of each clip, thus preserving semantic integrity and enhancing semantic content comprehension. Based on this, we present an efficient multi-scale temporal encoder designed to adeptly capture temporal information across a broad spectrum from fine to coarse granularity. Comprehensive experiments across four datasets–MSR-VTT, LSMDC, MSVD, and ActivityNet-Captions–and two distinct tasks–text-video retrieval and video-captioning–with consistent performance improvements highlight the significance of the presented multi-scale approach.

Abstract Image

查看原文本刊更多论文

跨模态表示学习的动态尺度位置嵌入

在本文中，我们介绍了一种新的方法来捕获跨多尺度视频中的时间信息，用于跨模式学习。由于视频自然地封装了不同持续时间的语义信息，现有的主要依赖于细粒度和粗粒度对比学习的方法可能无法完全捕获固有的语义信息。为了弥补这一差距，我们提出了动态尺度位置嵌入（DSPE），这是一种新颖的方法，通过动态调整时间位置嵌入，使单个变压器能够在不同的时间尺度上解释视频。与传统的聚合视频片段的多尺度方法相比，DSPE保留了每个视频片段的鲜明特征，从而保持了语义完整性并增强了语义内容的理解。基于此，我们提出了一种高效的多尺度时间编码器，旨在熟练地捕获从细到粗粒度的广谱时间信息。在四个数据集（msr - vtt、LSMDC、MSVD和activitynet - captions4）和两个不同的任务（文本-视频检索和视频字幕）上进行的综合实验显示，性能得到了一致的提高，突出了所提出的多尺度方法的重要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Neural Networks 工程技术-计算机：人工智能

CiteScore

13.90

自引率

7.70%

发文量

425

审稿时长

67 days

期刊介绍： Neural Networks is a platform that aims to foster an international community of scholars and practitioners interested in neural networks, deep learning, and other approaches to artificial intelligence and machine learning. Our journal invites submissions covering various aspects of neural networks research, from computational neuroscience and cognitive modeling to mathematical analyses and engineering applications. By providing a forum for interdisciplinary discussions between biology and technology, we aim to encourage the development of biologically-inspired artificial intelligence.