Daniel Rotman, Dror Porat, G. Ashour, Udi Barzelay
{"title":"基于归一化代价的深度特征最佳分组视频场景检测","authors":"Daniel Rotman, Dror Porat, G. Ashour, Udi Barzelay","doi":"10.1145/3206025.3206055","DOIUrl":null,"url":null,"abstract":"Video scene detection is the task of temporally dividing a video into its semantic sections. This is an important preliminary step for effective analysis of heterogeneous video content. We present a unique formulation of this task as a generic optimization problem with a novel normalized cost function, aimed at optimal grouping of consecutive shots into scenes. The mathematical properties of the proposed normalized cost function enable robust scene detection, also in challenging real-world scenarios. We present a novel dynamic programming formulation for efficiently optimizing the proposed cost function despite an inherent dependency between subproblems. We use deep neural network models for visual and audio analysis to encode the semantic elements in the video scene, enabling effective and more accurate video scene detection. The proposed method has two key advantages compared to other approaches: it inherently provides a temporally consistent division of the video into scenes, and is also parameter-free, eliminating the need for fine-tuning for different types of content. While our method can adaptively estimate the number of scenes from the video content, we also present a new non-greedy procedure for creating a hierarchical consensus-based division tree spanning multiple levels of granularity. We provide comprehensive experimental results showing the benefits of the normalized cost function, and demonstrating that the proposed method outperforms the current state of the art in video scene detection.","PeriodicalId":224132,"journal":{"name":"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"22","resultStr":"{\"title\":\"Optimally Grouped Deep Features Using Normalized Cost for Video Scene Detection\",\"authors\":\"Daniel Rotman, Dror Porat, G. Ashour, Udi Barzelay\",\"doi\":\"10.1145/3206025.3206055\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Video scene detection is the task of temporally dividing a video into its semantic sections. This is an important preliminary step for effective analysis of heterogeneous video content. We present a unique formulation of this task as a generic optimization problem with a novel normalized cost function, aimed at optimal grouping of consecutive shots into scenes. The mathematical properties of the proposed normalized cost function enable robust scene detection, also in challenging real-world scenarios. We present a novel dynamic programming formulation for efficiently optimizing the proposed cost function despite an inherent dependency between subproblems. We use deep neural network models for visual and audio analysis to encode the semantic elements in the video scene, enabling effective and more accurate video scene detection. The proposed method has two key advantages compared to other approaches: it inherently provides a temporally consistent division of the video into scenes, and is also parameter-free, eliminating the need for fine-tuning for different types of content. While our method can adaptively estimate the number of scenes from the video content, we also present a new non-greedy procedure for creating a hierarchical consensus-based division tree spanning multiple levels of granularity. We provide comprehensive experimental results showing the benefits of the normalized cost function, and demonstrating that the proposed method outperforms the current state of the art in video scene detection.\",\"PeriodicalId\":224132,\"journal\":{\"name\":\"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval\",\"volume\":\"39 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-06-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"22\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3206025.3206055\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3206025.3206055","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Optimally Grouped Deep Features Using Normalized Cost for Video Scene Detection
Video scene detection is the task of temporally dividing a video into its semantic sections. This is an important preliminary step for effective analysis of heterogeneous video content. We present a unique formulation of this task as a generic optimization problem with a novel normalized cost function, aimed at optimal grouping of consecutive shots into scenes. The mathematical properties of the proposed normalized cost function enable robust scene detection, also in challenging real-world scenarios. We present a novel dynamic programming formulation for efficiently optimizing the proposed cost function despite an inherent dependency between subproblems. We use deep neural network models for visual and audio analysis to encode the semantic elements in the video scene, enabling effective and more accurate video scene detection. The proposed method has two key advantages compared to other approaches: it inherently provides a temporally consistent division of the video into scenes, and is also parameter-free, eliminating the need for fine-tuning for different types of content. While our method can adaptively estimate the number of scenes from the video content, we also present a new non-greedy procedure for creating a hierarchical consensus-based division tree spanning multiple levels of granularity. We provide comprehensive experimental results showing the benefits of the normalized cost function, and demonstrating that the proposed method outperforms the current state of the art in video scene detection.