Multi-head attention with reinforcement learning for supervised video summarization

IF 1 4区计算机科学 Q4 ENGINEERING, ELECTRICAL & ELECTRONIC

Journal of Electronic Imaging Pub Date : 2024-09-01 DOI:10.1117/1.jei.33.5.053010

Bhakti Deepak Kadam, Ashwini Mangesh Deshpande

{"title":"Multi-head attention with reinforcement learning for supervised video summarization","authors":"Bhakti Deepak Kadam, Ashwini Mangesh Deshpande","doi":"10.1117/1.jei.33.5.053010","DOIUrl":null,"url":null,"abstract":"With the substantial surge in available internet video data, the intricate task of video summarization has consistently attracted the computer vision research community to summarize the videos meaningfully. Many recent summarization techniques leverage bidirectional long short-term memory for its proficiency in modeling temporal dependencies. However, its effectiveness is limited to short-duration video clips, typically up to 90 to 100 frames. To address this constraint, the proposed approach incorporates global and local multi-head attention, effectively capturing temporal dependencies at both global and local levels. This enhancement enables parallel computation, thereby improving overall performance for longer videos. This work considers video summarization as a supervised learning task and introduces a deep summarization architecture called multi-head attention with reinforcement learning (MHA-RL). The architecture comprises a pretrained convolutional neural network for extracting features from video frames, along with global and local multi-head attention mechanisms for predicting frame importance scores. Additionally, the network integrates an RL-based regressor network to consider the diversity and representativeness of the generated video summary. Extensive experimentation is conducted on benchmark datasets, such as TVSum and SumMe. The proposed method exhibits improved performance compared to the majority of state-of-the-art summarization techniques, as indicated by both qualitative and quantitative results.","PeriodicalId":54843,"journal":{"name":"Journal of Electronic Imaging","volume":"31 1","pages":""},"PeriodicalIF":1.0000,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Electronic Imaging","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1117/1.jei.33.5.053010","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}

引用次数: 0

Abstract

With the substantial surge in available internet video data, the intricate task of video summarization has consistently attracted the computer vision research community to summarize the videos meaningfully. Many recent summarization techniques leverage bidirectional long short-term memory for its proficiency in modeling temporal dependencies. However, its effectiveness is limited to short-duration video clips, typically up to 90 to 100 frames. To address this constraint, the proposed approach incorporates global and local multi-head attention, effectively capturing temporal dependencies at both global and local levels. This enhancement enables parallel computation, thereby improving overall performance for longer videos. This work considers video summarization as a supervised learning task and introduces a deep summarization architecture called multi-head attention with reinforcement learning (MHA-RL). The architecture comprises a pretrained convolutional neural network for extracting features from video frames, along with global and local multi-head attention mechanisms for predicting frame importance scores. Additionally, the network integrates an RL-based regressor network to consider the diversity and representativeness of the generated video summary. Extensive experimentation is conducted on benchmark datasets, such as TVSum and SumMe. The proposed method exhibits improved performance compared to the majority of state-of-the-art summarization techniques, as indicated by both qualitative and quantitative results.

查看原文本刊更多论文

多头注意力与强化学习用于有监督的视频总结

随着可用互联网视频数据的激增，视频摘要这一复杂任务一直吸引着计算机视觉研究界对视频进行有意义的摘要。最近的许多摘要技术都利用了双向长时短时记忆在时间依赖性建模方面的优势。然而，其有效性仅限于短时视频片段，通常最多为 90 至 100 帧。为解决这一限制，所提出的方法结合了全局和局部多头注意力，有效捕捉了全局和局部层面的时间依赖性。这一改进实现了并行计算，从而提高了较长视频的整体性能。本研究将视频摘要视为一种监督学习任务，并引入了一种名为 "多头注意力与强化学习（MHA-RL）"的深度摘要架构。该架构包括一个用于从视频帧中提取特征的预训练卷积神经网络，以及用于预测帧重要性得分的全局和局部多头注意力机制。此外，该网络还集成了基于 RL 的回归网络，以考虑生成的视频摘要的多样性和代表性。在 TVSum 和 SumMe 等基准数据集上进行了广泛的实验。从定性和定量结果来看，与大多数最先进的摘要技术相比，所提出的方法表现出更高的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Electronic Imaging 工程技术-成像科学与照相技术

CiteScore

1.70

自引率

27.30%

发文量

341

审稿时长

4.0 months

期刊介绍： The Journal of Electronic Imaging publishes peer-reviewed papers in all technology areas that make up the field of electronic imaging and are normally considered in the design, engineering, and applications of electronic imaging systems.