iESTA: Instance-Enhanced Spatial–Temporal Alignment for Video Copy Localization

IF 8.3 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2024-12-16 DOI:10.1109/TCSVT.2024.3517664

Xinmiao Ding;Jinming Lou;Wenyang Luo;Yufan Liu;Bing Li;Weiming Hu

{"title":"iESTA: Instance-Enhanced Spatial–Temporal Alignment for Video Copy Localization","authors":"Xinmiao Ding;Jinming Lou;Wenyang Luo;Yufan Liu;Bing Li;Weiming Hu","doi":"10.1109/TCSVT.2024.3517664","DOIUrl":null,"url":null,"abstract":"Video copy Segment Localization (VSL) requires the identification of the temporal segments within a pair of videos that contain copied content. Current methods primarily focus on global temporal modeling, overlooking the complementarity of global semantic and local fine-grained features, which limits their effectiveness. Some related methods attempt to incorporate local spatial information but often disrupt spatial semantic structures, resulting in less accurate matching. To address these issues, we propose the Instance-Enhanced Spatial-Temporal Alignment Framework (iESTA), based on a proper representation granularity that integrates instance-level local features and semantic global features. Specifically, the Instance-relation Graph (IRG) is constructed to capture instance-level features and fine-grained interactions, preserving local information integrity and better representing the video feature space in a proper granularity. An instance-GNN structure is designed to refine these graph representations. For global features, we enhance the representation of semantic information, capturing temporal relationships within videos using a Transformer framework. Additionally, we design a Complementarity-perception Alignment Module (CAM) to effectively process and integrate complementary spatial-temporal information, producing accurate frame-to-frame alignment maps. Our approach also incorporates a differentiable Dynamic Time Warping (DTW) method to utilize latent temporal alignments as weak supervisory signals, improving the accuracy of the matching process. Experimental results indicate that our proposed iESTA outperforms state-of-the-art methods on both the small-scale dataset VCDB and the large-scale dataset VCSL.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 5","pages":"4409-4422"},"PeriodicalIF":8.3000,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10802955/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Video copy Segment Localization (VSL) requires the identification of the temporal segments within a pair of videos that contain copied content. Current methods primarily focus on global temporal modeling, overlooking the complementarity of global semantic and local fine-grained features, which limits their effectiveness. Some related methods attempt to incorporate local spatial information but often disrupt spatial semantic structures, resulting in less accurate matching. To address these issues, we propose the Instance-Enhanced Spatial-Temporal Alignment Framework (iESTA), based on a proper representation granularity that integrates instance-level local features and semantic global features. Specifically, the Instance-relation Graph (IRG) is constructed to capture instance-level features and fine-grained interactions, preserving local information integrity and better representing the video feature space in a proper granularity. An instance-GNN structure is designed to refine these graph representations. For global features, we enhance the representation of semantic information, capturing temporal relationships within videos using a Transformer framework. Additionally, we design a Complementarity-perception Alignment Module (CAM) to effectively process and integrate complementary spatial-temporal information, producing accurate frame-to-frame alignment maps. Our approach also incorporates a differentiable Dynamic Time Warping (DTW) method to utilize latent temporal alignments as weak supervisory signals, improving the accuracy of the matching process. Experimental results indicate that our proposed iESTA outperforms state-of-the-art methods on both the small-scale dataset VCDB and the large-scale dataset VCSL.

查看原文本刊更多论文

iESTA：实例增强的时空对齐视频副本定位

视频复制片段定位（VSL）要求在一对包含复制内容的视频中识别时间片段。目前的方法主要关注全局时态建模，忽略了全局语义和局部细粒度特征的互补性，限制了其有效性。一些相关的方法试图结合局部空间信息，但往往破坏空间语义结构，导致匹配精度降低。为了解决这些问题，我们提出了基于适当的表示粒度的实例增强型时空对齐框架（iESTA），该框架集成了实例级局部特征和语义全局特征。具体而言，构建实例关系图（IRG）以捕获实例级特征和细粒度交互，保持局部信息完整性并以适当的粒度更好地表示视频特征空间。设计了一个实例- gnn结构来细化这些图表示。对于全局特征，我们增强了语义信息的表示，使用Transformer框架捕获视频中的时间关系。此外，我们设计了一个互补感知对齐模块（CAM）来有效地处理和整合互补的时空信息，生成精确的帧对帧对齐图。我们的方法还结合了可微分动态时间翘曲（DTW）方法，利用潜在的时间序列作为弱监督信号，提高匹配过程的准确性。实验结果表明，我们提出的iESTA在小规模数据集VCDB和大规模数据集VCSL上都优于目前最先进的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Circuits and Systems for Video Technology 工程技术-工程：电子与电气

CiteScore

13.80

自引率

27.40%

发文量

660

审稿时长

5 months

期刊介绍： The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.