{"title":"Multi-head attention with reinforcement learning for supervised video summarization","authors":"Bhakti Deepak Kadam, Ashwini Mangesh Deshpande","doi":"10.1117/1.jei.33.5.053010","DOIUrl":null,"url":null,"abstract":"With the substantial surge in available internet video data, the intricate task of video summarization has consistently attracted the computer vision research community to summarize the videos meaningfully. Many recent summarization techniques leverage bidirectional long short-term memory for its proficiency in modeling temporal dependencies. However, its effectiveness is limited to short-duration video clips, typically up to 90 to 100 frames. To address this constraint, the proposed approach incorporates global and local multi-head attention, effectively capturing temporal dependencies at both global and local levels. This enhancement enables parallel computation, thereby improving overall performance for longer videos. This work considers video summarization as a supervised learning task and introduces a deep summarization architecture called multi-head attention with reinforcement learning (MHA-RL). The architecture comprises a pretrained convolutional neural network for extracting features from video frames, along with global and local multi-head attention mechanisms for predicting frame importance scores. Additionally, the network integrates an RL-based regressor network to consider the diversity and representativeness of the generated video summary. Extensive experimentation is conducted on benchmark datasets, such as TVSum and SumMe. The proposed method exhibits improved performance compared to the majority of state-of-the-art summarization techniques, as indicated by both qualitative and quantitative results.","PeriodicalId":54843,"journal":{"name":"Journal of Electronic Imaging","volume":"31 1","pages":""},"PeriodicalIF":1.0000,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Electronic Imaging","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1117/1.jei.33.5.053010","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
With the substantial surge in available internet video data, the intricate task of video summarization has consistently attracted the computer vision research community to summarize the videos meaningfully. Many recent summarization techniques leverage bidirectional long short-term memory for its proficiency in modeling temporal dependencies. However, its effectiveness is limited to short-duration video clips, typically up to 90 to 100 frames. To address this constraint, the proposed approach incorporates global and local multi-head attention, effectively capturing temporal dependencies at both global and local levels. This enhancement enables parallel computation, thereby improving overall performance for longer videos. This work considers video summarization as a supervised learning task and introduces a deep summarization architecture called multi-head attention with reinforcement learning (MHA-RL). The architecture comprises a pretrained convolutional neural network for extracting features from video frames, along with global and local multi-head attention mechanisms for predicting frame importance scores. Additionally, the network integrates an RL-based regressor network to consider the diversity and representativeness of the generated video summary. Extensive experimentation is conducted on benchmark datasets, such as TVSum and SumMe. The proposed method exhibits improved performance compared to the majority of state-of-the-art summarization techniques, as indicated by both qualitative and quantitative results.
期刊介绍:
The Journal of Electronic Imaging publishes peer-reviewed papers in all technology areas that make up the field of electronic imaging and are normally considered in the design, engineering, and applications of electronic imaging systems.