{"title":"融合手工制作和深度视听特征的用户生成视频综述","authors":"Theodoros Psallidas, E. Spyrou, S. Perantonis","doi":"10.1109/SMAP56125.2022.9941864","DOIUrl":null,"url":null,"abstract":"The ever-increasing amount of user-generated audiovisual content has increased the demand for easy navigation across content collections and repositories, necessitating detailed, yet concise content representations. A typical method to this goal is to construct a visual summary, which is significantly more expressive than other alternatives, such as verbal annotations. In this paper, we describe a video summarization technique which is based on the extraction and the fusion of audio and visual data, in order to generate dynamic video summaries, i.e., video summaries that include the most essential video segments from the original video, while maintaining their original temporal sequence. Based on the extracted features, each video segment is classified as being “interesting” or “uninteresting,” and hence included or excluded from the final summary. The originality of our technique is that prior to classification, we employ a transfer learning strategy to extract deep features from pre-trained models as input to the classifiers, making them more intuitive and robust to objectiveness. We evaluate our technique on a large dataset of user-generated videos and demonstrate that the addition of deep features is able to improve classification performance, resulting in more concrete video summaries, compared to the use of only hand-crafted features.","PeriodicalId":432172,"journal":{"name":"2022 17th International Workshop on Semantic and Social Media Adaptation & Personalization (SMAP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Summarization of User-Generated Videos Fusing Handcrafted and Deep Audiovisual Features\",\"authors\":\"Theodoros Psallidas, E. Spyrou, S. Perantonis\",\"doi\":\"10.1109/SMAP56125.2022.9941864\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The ever-increasing amount of user-generated audiovisual content has increased the demand for easy navigation across content collections and repositories, necessitating detailed, yet concise content representations. A typical method to this goal is to construct a visual summary, which is significantly more expressive than other alternatives, such as verbal annotations. In this paper, we describe a video summarization technique which is based on the extraction and the fusion of audio and visual data, in order to generate dynamic video summaries, i.e., video summaries that include the most essential video segments from the original video, while maintaining their original temporal sequence. Based on the extracted features, each video segment is classified as being “interesting” or “uninteresting,” and hence included or excluded from the final summary. The originality of our technique is that prior to classification, we employ a transfer learning strategy to extract deep features from pre-trained models as input to the classifiers, making them more intuitive and robust to objectiveness. We evaluate our technique on a large dataset of user-generated videos and demonstrate that the addition of deep features is able to improve classification performance, resulting in more concrete video summaries, compared to the use of only hand-crafted features.\",\"PeriodicalId\":432172,\"journal\":{\"name\":\"2022 17th International Workshop on Semantic and Social Media Adaptation & Personalization (SMAP)\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 17th International Workshop on Semantic and Social Media Adaptation & Personalization (SMAP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SMAP56125.2022.9941864\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 17th International Workshop on Semantic and Social Media Adaptation & Personalization (SMAP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SMAP56125.2022.9941864","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Summarization of User-Generated Videos Fusing Handcrafted and Deep Audiovisual Features
The ever-increasing amount of user-generated audiovisual content has increased the demand for easy navigation across content collections and repositories, necessitating detailed, yet concise content representations. A typical method to this goal is to construct a visual summary, which is significantly more expressive than other alternatives, such as verbal annotations. In this paper, we describe a video summarization technique which is based on the extraction and the fusion of audio and visual data, in order to generate dynamic video summaries, i.e., video summaries that include the most essential video segments from the original video, while maintaining their original temporal sequence. Based on the extracted features, each video segment is classified as being “interesting” or “uninteresting,” and hence included or excluded from the final summary. The originality of our technique is that prior to classification, we employ a transfer learning strategy to extract deep features from pre-trained models as input to the classifiers, making them more intuitive and robust to objectiveness. We evaluate our technique on a large dataset of user-generated videos and demonstrate that the addition of deep features is able to improve classification performance, resulting in more concrete video summaries, compared to the use of only hand-crafted features.