Hybrid Siamese Masked Autoencoders as Unsupervised Video Summarizer

IF 11.1 1区工程技术 Q1 ENGINEERING, ELECTRICAL & ELECTRONIC

IEEE Transactions on Circuits and Systems for Video Technology Pub Date : 2025-04-02 DOI:10.1109/TCSVT.2025.3557254

Yifei Xu;Zaiqiang Wu;Li Li;Siqi Li;Wenlong Li;Mingqi Li;Yuan Rao;Shuiguang Deng

{"title":"Hybrid Siamese Masked Autoencoders as Unsupervised Video Summarizer","authors":"Yifei Xu;Zaiqiang Wu;Li Li;Siqi Li;Wenlong Li;Mingqi Li;Yuan Rao;Shuiguang Deng","doi":"10.1109/TCSVT.2025.3557254","DOIUrl":null,"url":null,"abstract":"Video summarization aims to seek the most important information from a source video while still retaining its primary content. In practical application, unsupervised video summarizers are acknowledged for their flexibility and superiority without requiring annotated data. However, they are looking for the determined rules on how much each frame is essential enough to be selected as a summary. Unlike conventional frame-based scoring methods, we propose a shot-level unsupervised video summarizer termed Hybrid Siamese Masked Autoencoders (H-SMAE) from a higher semantic perspective. Specifically, our method consists of Multi-view Siamese Masked Autoencoders (MV-SMAE) and Shot Diversity Enhancer (SDE). MV-SMAE tries to recover the masked shots from original frame feature and three unmasked shot subsets with elaborate Siamese masked autoencoders. Inspired by the masking idea in MAE, MV-SMAE introduces a Siamese architecture to model prior references to guide the reconstruction of masked shots. Besides, SDE improves the diversity of generated summary by minimizing the repelling loss among selected shots. Afterward, these two modules are fused followed by 0-1 knapsack algorithm to produce a video summary. Experiments on two challenging and diverse datasets demonstrate that our approach outperforms other state-of-the-art unsupervised and weakly-supervised methods, and even generates comparable results with several excellent supervised methods. The source code of H-SMAE is available at <uri>https://github.com/wzq0214/H-SMAE</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9487-9501"},"PeriodicalIF":11.1000,"publicationDate":"2025-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10947580/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

Video summarization aims to seek the most important information from a source video while still retaining its primary content. In practical application, unsupervised video summarizers are acknowledged for their flexibility and superiority without requiring annotated data. However, they are looking for the determined rules on how much each frame is essential enough to be selected as a summary. Unlike conventional frame-based scoring methods, we propose a shot-level unsupervised video summarizer termed Hybrid Siamese Masked Autoencoders (H-SMAE) from a higher semantic perspective. Specifically, our method consists of Multi-view Siamese Masked Autoencoders (MV-SMAE) and Shot Diversity Enhancer (SDE). MV-SMAE tries to recover the masked shots from original frame feature and three unmasked shot subsets with elaborate Siamese masked autoencoders. Inspired by the masking idea in MAE, MV-SMAE introduces a Siamese architecture to model prior references to guide the reconstruction of masked shots. Besides, SDE improves the diversity of generated summary by minimizing the repelling loss among selected shots. Afterward, these two modules are fused followed by 0-1 knapsack algorithm to produce a video summary. Experiments on two challenging and diverse datasets demonstrate that our approach outperforms other state-of-the-art unsupervised and weakly-supervised methods, and even generates comparable results with several excellent supervised methods. The source code of H-SMAE is available at https://github.com/wzq0214/H-SMAE.

查看原文本刊更多论文

混合暹罗蒙面自动编码器作为无监督视频摘要

视频摘要的目的是从源视频中寻找最重要的信息，同时保留其主要内容。在实际应用中，无监督视频摘要器因其不需要标注数据的灵活性和优越性而得到认可。然而，他们正在寻找确定的规则，即每帧有多少是必要的，足以被选为摘要。与传统的基于帧的评分方法不同，我们从更高的语义角度提出了一种称为混合暹罗蒙面自动编码器（H-SMAE）的镜头级无监督视频摘要器。具体来说，我们的方法由多视图连体掩码自动编码器（MV-SMAE）和镜头分集增强器（SDE）组成。MV-SMAE试图从原始帧特征和三个未被掩盖的镜头子集中恢复被掩盖的镜头，并使用复杂的暹罗掩码自编码器。受MAE中的掩蔽思想的启发，MV-SMAE引入了一个Siamese架构来建模先前的参考，以指导掩蔽镜头的重建。此外，SDE通过最小化所选镜头之间的排斥损失来提高生成摘要的多样性。然后，将这两个模块融合，然后使用0-1背包算法生成视频摘要。在两个具有挑战性和多样化的数据集上的实验表明，我们的方法优于其他最先进的无监督和弱监督方法，甚至可以与几种优秀的监督方法产生可比的结果。H-SMAE的源代码可在https://github.com/wzq0214/H-SMAE获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Circuits and Systems for Video Technology 工程技术-工程：电子与电气

CiteScore

13.80

自引率

27.40%

发文量

660

审稿时长

5 months

期刊介绍： The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.