通过探索相应字幕之间的连贯性来改进视频摘要

IF 13.7

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society Pub Date : 2025-08-20 DOI:10.1109/TIP.2025.3598709

Cheng Ye;Weidong Chen;Bo Hu;Lei Zhang;Yongdong Zhang;Zhendong Mao

{"title":"通过探索相应字幕之间的连贯性来改进视频摘要","authors":"Cheng Ye;Weidong Chen;Bo Hu;Lei Zhang;Yongdong Zhang;Zhendong Mao","doi":"10.1109/TIP.2025.3598709","DOIUrl":null,"url":null,"abstract":"Video summarization aims to generate a compact summary of the original video by selecting and combining the most representative parts. Most existing approaches only focus on recognizing key video segments to generate the summary, which lacks holistic considerations. The transitions between selected video segments are usually abrupt and inconsistent, making the summary confusing. Indeed, the coherence of video summaries is crucial to improve the quality and user viewing experience. However, the coherence between video segments is hard to measure and optimize from a pure vision perspective. To this end, we propose a Language-guided Segment Coherence-Aware Network (LS-CAN), which integrates entire coherence considerations into the key segment recognition. The main idea of LS-CAN is to explore the coherence of corresponding text modality to facilitate the entire coherence of the video summary, which leverages the natural property in the language that contextual coherence is easy to measure. In terms of text coherence measures, specifically, we propose the multi-graph correlated neural network module (MGCNN), which constructs a graph for each sentence based on three key components, i.e., subject, attribute, and action words. For each sentence pair, the node features are then discriminatively learned by incorporating neighbors of its own graph and information of its dual graph, reducing the error of synonyms or reference relationships in measuring the correlation between sentences, as well as the error caused by considering each component separately. In doing so, MGCNN utilizes subject agreement, attribute coherence, and action succession to measure text coherence. Besides, with the help of large language models, we augment the original text coherence annotations, improving the ability of MGCNN to judge coherence. Extensive experiments on three challenging datasets demonstrate the superiority of our approach and each proposed module, especially improving the latest records by +3.8%, +14.2% and +12% w.r.t. F1 scores, <inline-formula> <tex-math>$\\tau $ </tex-math></inline-formula> and <inline-formula> <tex-math>$\\rho $ </tex-math></inline-formula> metrics on the BLiSS dataset.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":"34 ","pages":"5369-5384"},"PeriodicalIF":13.7000,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Improving Video Summarization by Exploring the Coherence Between Corresponding Captions\",\"authors\":\"Cheng Ye;Weidong Chen;Bo Hu;Lei Zhang;Yongdong Zhang;Zhendong Mao\",\"doi\":\"10.1109/TIP.2025.3598709\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Video summarization aims to generate a compact summary of the original video by selecting and combining the most representative parts. Most existing approaches only focus on recognizing key video segments to generate the summary, which lacks holistic considerations. The transitions between selected video segments are usually abrupt and inconsistent, making the summary confusing. Indeed, the coherence of video summaries is crucial to improve the quality and user viewing experience. However, the coherence between video segments is hard to measure and optimize from a pure vision perspective. To this end, we propose a Language-guided Segment Coherence-Aware Network (LS-CAN), which integrates entire coherence considerations into the key segment recognition. The main idea of LS-CAN is to explore the coherence of corresponding text modality to facilitate the entire coherence of the video summary, which leverages the natural property in the language that contextual coherence is easy to measure. In terms of text coherence measures, specifically, we propose the multi-graph correlated neural network module (MGCNN), which constructs a graph for each sentence based on three key components, i.e., subject, attribute, and action words. For each sentence pair, the node features are then discriminatively learned by incorporating neighbors of its own graph and information of its dual graph, reducing the error of synonyms or reference relationships in measuring the correlation between sentences, as well as the error caused by considering each component separately. In doing so, MGCNN utilizes subject agreement, attribute coherence, and action succession to measure text coherence. Besides, with the help of large language models, we augment the original text coherence annotations, improving the ability of MGCNN to judge coherence. Extensive experiments on three challenging datasets demonstrate the superiority of our approach and each proposed module, especially improving the latest records by +3.8%, +14.2% and +12% w.r.t. F1 scores, <inline-formula> <tex-math>$\\\\tau $ </tex-math></inline-formula> and <inline-formula> <tex-math>$\\\\rho $ </tex-math></inline-formula> metrics on the BLiSS dataset.\",\"PeriodicalId\":94032,\"journal\":{\"name\":\"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society\",\"volume\":\"34 \",\"pages\":\"5369-5384\"},\"PeriodicalIF\":13.7000,\"publicationDate\":\"2025-08-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/11130654/\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/11130654/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

视频摘要的目的是通过选择和组合最具代表性的部分，对原始视频进行简洁的总结。现有的方法大多只关注关键视频片段的识别来生成摘要，缺乏对整体的考虑。所选视频片段之间的过渡通常是突然的和不一致的，使得摘要令人困惑。事实上，视频摘要的连贯性对于提高质量和用户观看体验至关重要。然而，视频片段之间的一致性很难从纯粹的视觉角度来衡量和优化。为此，我们提出了一种语言引导的词段连贯感知网络（LS-CAN），该网络将整个连贯因素集成到关键词段识别中。LS-CAN的主要思想是探索相应文本情态的连贯性，以促进视频摘要的整体连贯性，这利用了语言中易于测量的上下文连贯性的自然属性。在文本连贯度量方面，我们提出了多图相关神经网络模块（multigraph correlation neural network module， MGCNN），该模块基于三个关键成分，即主语、定语和动作词，为每个句子构建一个图。对于每个句子对，通过结合其自身图的邻居和对偶图的信息来判别学习节点特征，减少同义词或参考关系在衡量句子之间相关性时的误差，以及单独考虑每个成分所带来的误差。在此过程中，MGCNN使用主语一致性、属性连贯性和动作连续性来测量文本连贯性。此外，在大型语言模型的帮助下，我们增强了原始文本的连贯注释，提高了MGCNN对连贯的判断能力。在三个具有挑战性的数据集上进行的大量实验证明了我们的方法和每个提出的模块的优越性，特别是将最新记录提高了+3.8%, +14.2% and +12% w.r.t. F1 scores, $\tau $ and $\rho $ metrics on the BLiSS dataset.

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Improving Video Summarization by Exploring the Coherence Between Corresponding Captions

Video summarization aims to generate a compact summary of the original video by selecting and combining the most representative parts. Most existing approaches only focus on recognizing key video segments to generate the summary, which lacks holistic considerations. The transitions between selected video segments are usually abrupt and inconsistent, making the summary confusing. Indeed, the coherence of video summaries is crucial to improve the quality and user viewing experience. However, the coherence between video segments is hard to measure and optimize from a pure vision perspective. To this end, we propose a Language-guided Segment Coherence-Aware Network (LS-CAN), which integrates entire coherence considerations into the key segment recognition. The main idea of LS-CAN is to explore the coherence of corresponding text modality to facilitate the entire coherence of the video summary, which leverages the natural property in the language that contextual coherence is easy to measure. In terms of text coherence measures, specifically, we propose the multi-graph correlated neural network module (MGCNN), which constructs a graph for each sentence based on three key components, i.e., subject, attribute, and action words. For each sentence pair, the node features are then discriminatively learned by incorporating neighbors of its own graph and information of its dual graph, reducing the error of synonyms or reference relationships in measuring the correlation between sentences, as well as the error caused by considering each component separately. In doing so, MGCNN utilizes subject agreement, attribute coherence, and action succession to measure text coherence. Besides, with the help of large language models, we augment the original text coherence annotations, improving the ability of MGCNN to judge coherence. Extensive experiments on three challenging datasets demonstrate the superiority of our approach and each proposed module, especially improving the latest records by +3.8%, +14.2% and +12% w.r.t. F1 scores,

$\tau $

and

$\rho $

metrics on the BLiSS dataset.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE transactions on image processing : a publication of the IEEE Signal Processing Society

自引率

0.00%

发文量