Yifei Xu;Zaiqiang Wu;Li Li;Siqi Li;Wenlong Li;Mingqi Li;Yuan Rao;Shuiguang Deng
{"title":"Hybrid Siamese Masked Autoencoders as Unsupervised Video Summarizer","authors":"Yifei Xu;Zaiqiang Wu;Li Li;Siqi Li;Wenlong Li;Mingqi Li;Yuan Rao;Shuiguang Deng","doi":"10.1109/TCSVT.2025.3557254","DOIUrl":null,"url":null,"abstract":"Video summarization aims to seek the most important information from a source video while still retaining its primary content. In practical application, unsupervised video summarizers are acknowledged for their flexibility and superiority without requiring annotated data. However, they are looking for the determined rules on how much each frame is essential enough to be selected as a summary. Unlike conventional frame-based scoring methods, we propose a shot-level unsupervised video summarizer termed Hybrid Siamese Masked Autoencoders (H-SMAE) from a higher semantic perspective. Specifically, our method consists of Multi-view Siamese Masked Autoencoders (MV-SMAE) and Shot Diversity Enhancer (SDE). MV-SMAE tries to recover the masked shots from original frame feature and three unmasked shot subsets with elaborate Siamese masked autoencoders. Inspired by the masking idea in MAE, MV-SMAE introduces a Siamese architecture to model prior references to guide the reconstruction of masked shots. Besides, SDE improves the diversity of generated summary by minimizing the repelling loss among selected shots. Afterward, these two modules are fused followed by 0-1 knapsack algorithm to produce a video summary. Experiments on two challenging and diverse datasets demonstrate that our approach outperforms other state-of-the-art unsupervised and weakly-supervised methods, and even generates comparable results with several excellent supervised methods. The source code of H-SMAE is available at <uri>https://github.com/wzq0214/H-SMAE</uri>.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 9","pages":"9487-9501"},"PeriodicalIF":11.1000,"publicationDate":"2025-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10947580/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Video summarization aims to seek the most important information from a source video while still retaining its primary content. In practical application, unsupervised video summarizers are acknowledged for their flexibility and superiority without requiring annotated data. However, they are looking for the determined rules on how much each frame is essential enough to be selected as a summary. Unlike conventional frame-based scoring methods, we propose a shot-level unsupervised video summarizer termed Hybrid Siamese Masked Autoencoders (H-SMAE) from a higher semantic perspective. Specifically, our method consists of Multi-view Siamese Masked Autoencoders (MV-SMAE) and Shot Diversity Enhancer (SDE). MV-SMAE tries to recover the masked shots from original frame feature and three unmasked shot subsets with elaborate Siamese masked autoencoders. Inspired by the masking idea in MAE, MV-SMAE introduces a Siamese architecture to model prior references to guide the reconstruction of masked shots. Besides, SDE improves the diversity of generated summary by minimizing the repelling loss among selected shots. Afterward, these two modules are fused followed by 0-1 knapsack algorithm to produce a video summary. Experiments on two challenging and diverse datasets demonstrate that our approach outperforms other state-of-the-art unsupervised and weakly-supervised methods, and even generates comparable results with several excellent supervised methods. The source code of H-SMAE is available at https://github.com/wzq0214/H-SMAE.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.