Video Summarization Based on Feature Fusion and Data Augmentation

IF 2.6 Q2 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS

Computers Pub Date : 2023-09-15 DOI:10.3390/computers12090186

Theodoros Psallidas, Evaggelos Spyrou

{"title":"Video Summarization Based on Feature Fusion and Data Augmentation","authors":"Theodoros Psallidas, Evaggelos Spyrou","doi":"10.3390/computers12090186","DOIUrl":null,"url":null,"abstract":"During the last few years, several technological advances have led to an increase in the creation and consumption of audiovisual multimedia content. Users are overexposed to videos via several social media or video sharing websites and mobile phone applications. For efficient browsing, searching, and navigation across several multimedia collections and repositories, e.g., for finding videos that are relevant to a particular topic or interest, this ever-increasing content should be efficiently described by informative yet concise content representations. A common solution to this problem is the construction of a brief summary of a video, which could be presented to the user, instead of the full video, so that she/he could then decide whether to watch or ignore the whole video. Such summaries are ideally more expressive than other alternatives, such as brief textual descriptions or keywords. In this work, the video summarization problem is approached as a supervised classification task, which relies on feature fusion of audio and visual data. Specifically, the goal of this work is to generate dynamic video summaries, i.e., compositions of parts of the original video, which include its most essential video segments, while preserving the original temporal sequence. This work relies on annotated datasets on a per-frame basis, wherein parts of videos are annotated as being “informative” or “noninformative”, with the latter being excluded from the produced summary. The novelties of the proposed approach are, (a) prior to classification, a transfer learning strategy to use deep features from pretrained models is employed. These models have been used as input to the classifiers, making them more intuitive and robust to objectiveness, and (b) the training dataset was augmented by using other publicly available datasets. The proposed approach is evaluated using three datasets of user-generated videos, and it is demonstrated that deep features and data augmentation are able to improve the accuracy of video summaries based on human annotations. Moreover, it is domain independent, could be used on any video, and could be extended to rely on richer feature representations or include other data modalities.","PeriodicalId":46292,"journal":{"name":"Computers","volume":"194 1","pages":"0"},"PeriodicalIF":2.6000,"publicationDate":"2023-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/computers12090186","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

During the last few years, several technological advances have led to an increase in the creation and consumption of audiovisual multimedia content. Users are overexposed to videos via several social media or video sharing websites and mobile phone applications. For efficient browsing, searching, and navigation across several multimedia collections and repositories, e.g., for finding videos that are relevant to a particular topic or interest, this ever-increasing content should be efficiently described by informative yet concise content representations. A common solution to this problem is the construction of a brief summary of a video, which could be presented to the user, instead of the full video, so that she/he could then decide whether to watch or ignore the whole video. Such summaries are ideally more expressive than other alternatives, such as brief textual descriptions or keywords. In this work, the video summarization problem is approached as a supervised classification task, which relies on feature fusion of audio and visual data. Specifically, the goal of this work is to generate dynamic video summaries, i.e., compositions of parts of the original video, which include its most essential video segments, while preserving the original temporal sequence. This work relies on annotated datasets on a per-frame basis, wherein parts of videos are annotated as being “informative” or “noninformative”, with the latter being excluded from the produced summary. The novelties of the proposed approach are, (a) prior to classification, a transfer learning strategy to use deep features from pretrained models is employed. These models have been used as input to the classifiers, making them more intuitive and robust to objectiveness, and (b) the training dataset was augmented by using other publicly available datasets. The proposed approach is evaluated using three datasets of user-generated videos, and it is demonstrated that deep features and data augmentation are able to improve the accuracy of video summaries based on human annotations. Moreover, it is domain independent, could be used on any video, and could be extended to rely on richer feature representations or include other data modalities.

查看原文本刊更多论文

基于特征融合和数据增强的视频摘要

在过去几年中，几项技术进步导致视听多媒体内容的创作和消费增加。用户通过几个社交媒体或视频分享网站和手机应用程序过度接触视频。为了在多个多媒体集合和存储库之间高效地浏览、搜索和导航，例如，为了寻找与特定主题或兴趣相关的视频，这种不断增加的内容应该通过信息丰富而简洁的内容表示来有效地描述。解决这个问题的一个常见方法是构建一个视频的简短摘要，它可以呈现给用户，而不是完整的视频，这样用户就可以决定是观看还是忽略整个视频。理想情况下，这样的摘要比其他选择(如简短的文本描述或关键字)更具表现力。本文将视频摘要问题作为一个依赖于音视频数据特征融合的监督分类问题来研究。具体来说，这项工作的目标是生成动态视频摘要，即原始视频的部分组成，其中包括其最重要的视频片段，同时保留原始时间序列。这项工作依赖于每帧基础上的注释数据集，其中视频的部分被注释为“信息”或“非信息”，后者被排除在生成的摘要之外。该方法的新颖之处在于:(a)在分类之前，采用迁移学习策略来使用预训练模型的深度特征。这些模型被用作分类器的输入，使它们更加直观和健壮，并且(b)通过使用其他公开可用的数据集来增强训练数据集。使用三个用户生成视频数据集对该方法进行了评估，结果表明深度特征和数据增强能够提高基于人工注释的视频摘要的准确性。此外，它是领域独立的，可以在任何视频上使用，并且可以扩展到依赖于更丰富的特征表示或包括其他数据模式。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊