基于归一化代价的深度特征最佳分组视频场景检测

Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval Pub Date : 2018-06-05 DOI:10.1145/3206025.3206055

Daniel Rotman, Dror Porat, G. Ashour, Udi Barzelay

{"title":"基于归一化代价的深度特征最佳分组视频场景检测","authors":"Daniel Rotman, Dror Porat, G. Ashour, Udi Barzelay","doi":"10.1145/3206025.3206055","DOIUrl":null,"url":null,"abstract":"Video scene detection is the task of temporally dividing a video into its semantic sections. This is an important preliminary step for effective analysis of heterogeneous video content. We present a unique formulation of this task as a generic optimization problem with a novel normalized cost function, aimed at optimal grouping of consecutive shots into scenes. The mathematical properties of the proposed normalized cost function enable robust scene detection, also in challenging real-world scenarios. We present a novel dynamic programming formulation for efficiently optimizing the proposed cost function despite an inherent dependency between subproblems. We use deep neural network models for visual and audio analysis to encode the semantic elements in the video scene, enabling effective and more accurate video scene detection. The proposed method has two key advantages compared to other approaches: it inherently provides a temporally consistent division of the video into scenes, and is also parameter-free, eliminating the need for fine-tuning for different types of content. While our method can adaptively estimate the number of scenes from the video content, we also present a new non-greedy procedure for creating a hierarchical consensus-based division tree spanning multiple levels of granularity. We provide comprehensive experimental results showing the benefits of the normalized cost function, and demonstrating that the proposed method outperforms the current state of the art in video scene detection.","PeriodicalId":224132,"journal":{"name":"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval","volume":"39 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"22","resultStr":"{\"title\":\"Optimally Grouped Deep Features Using Normalized Cost for Video Scene Detection\",\"authors\":\"Daniel Rotman, Dror Porat, G. Ashour, Udi Barzelay\",\"doi\":\"10.1145/3206025.3206055\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Video scene detection is the task of temporally dividing a video into its semantic sections. This is an important preliminary step for effective analysis of heterogeneous video content. We present a unique formulation of this task as a generic optimization problem with a novel normalized cost function, aimed at optimal grouping of consecutive shots into scenes. The mathematical properties of the proposed normalized cost function enable robust scene detection, also in challenging real-world scenarios. We present a novel dynamic programming formulation for efficiently optimizing the proposed cost function despite an inherent dependency between subproblems. We use deep neural network models for visual and audio analysis to encode the semantic elements in the video scene, enabling effective and more accurate video scene detection. The proposed method has two key advantages compared to other approaches: it inherently provides a temporally consistent division of the video into scenes, and is also parameter-free, eliminating the need for fine-tuning for different types of content. While our method can adaptively estimate the number of scenes from the video content, we also present a new non-greedy procedure for creating a hierarchical consensus-based division tree spanning multiple levels of granularity. We provide comprehensive experimental results showing the benefits of the normalized cost function, and demonstrating that the proposed method outperforms the current state of the art in video scene detection.\",\"PeriodicalId\":224132,\"journal\":{\"name\":\"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval\",\"volume\":\"39 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-06-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"22\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3206025.3206055\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3206025.3206055","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 22

摘要

视频场景检测是将视频暂时划分为其语义部分的任务。这是对异构视频内容进行有效分析的重要前期步骤。我们提出了这个任务的一个独特的公式，作为一个具有新的归一化成本函数的通用优化问题，旨在将连续镜头最优地分组到场景中。所提出的归一化代价函数的数学特性使鲁棒的场景检测成为可能，也适用于具有挑战性的现实场景。尽管子问题之间存在固有的依赖关系，但我们提出了一种新的动态规划公式来有效地优化所提出的成本函数。我们使用深度神经网络模型进行视觉和音频分析，对视频场景中的语义元素进行编码，从而实现更有效、更准确的视频场景检测。与其他方法相比，所提出的方法有两个关键优势:它本质上提供了一个暂时一致的视频场景划分，并且也是无参数的，消除了对不同类型内容进行微调的需要。虽然我们的方法可以自适应地估计视频内容中的场景数量，但我们还提出了一种新的非贪婪过程，用于创建跨多个粒度级别的基于共识的分层划分树。我们提供了全面的实验结果，显示了归一化代价函数的优点，并证明了所提出的方法优于当前视频场景检测的最新技术。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Optimally Grouped Deep Features Using Normalized Cost for Video Scene Detection

Video scene detection is the task of temporally dividing a video into its semantic sections. This is an important preliminary step for effective analysis of heterogeneous video content. We present a unique formulation of this task as a generic optimization problem with a novel normalized cost function, aimed at optimal grouping of consecutive shots into scenes. The mathematical properties of the proposed normalized cost function enable robust scene detection, also in challenging real-world scenarios. We present a novel dynamic programming formulation for efficiently optimizing the proposed cost function despite an inherent dependency between subproblems. We use deep neural network models for visual and audio analysis to encode the semantic elements in the video scene, enabling effective and more accurate video scene detection. The proposed method has two key advantages compared to other approaches: it inherently provides a temporally consistent division of the video into scenes, and is also parameter-free, eliminating the need for fine-tuning for different types of content. While our method can adaptively estimate the number of scenes from the video content, we also present a new non-greedy procedure for creating a hierarchical consensus-based division tree spanning multiple levels of granularity. We provide comprehensive experimental results showing the benefits of the normalized cost function, and demonstrating that the proposed method outperforms the current state of the art in video scene detection.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2018 ACM on International Conference on Multimedia Retrieval

自引率

0.00%

发文量