基于前向-后向网络的视频预测:更深入的时空一致性研究

Proceedings of the 26th ACM international conference on Multimedia Pub Date : 2018-10-15 DOI:10.1145/3240508.3240551

Yuke Li

{"title":"基于前向-后向网络的视频预测:更深入的时空一致性研究","authors":"Yuke Li","doi":"10.1145/3240508.3240551","DOIUrl":null,"url":null,"abstract":"Video forecasting is an emerging topic in the computer vision field, and it is a pivotal step toward unsupervised video understanding. However, the predictions generated from the state-of-the-art methods might be far from ideal quality, due to a lack of guidance from the labeled data of correct predictions (e.g., the annotated future pose of a person). Hence, building a network for better predicting future sequences in an unsupervised manner has to be further pursued. To this end, we put forth a novel Forward-Backward-Net (FB-Net) architecture, which delves deeper into spatiotemporal consistency. It first derives the forward consistency from the raw historical observations. In contrast to mainstream video forecasting approaches, FB-Net then investigates the backward consistency from the future to the past to reinforce the predictions. The final predicted results are inferred by jointly taking both the forward and backward consistencies into account. Moreover, we embed the motion dynamics and the visual content into a single framework via the FB-Net architecture, which significantly differs from learning each component throughout the videos separately. We evaluate our FB-Net on the large-scale KTH and UCF101 datasets. The experiments show that it can introduce considerable margin improvements with respect to most recent leading studies.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Video Forecasting with Forward-Backward-Net: Delving Deeper into Spatiotemporal Consistency\",\"authors\":\"Yuke Li\",\"doi\":\"10.1145/3240508.3240551\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Video forecasting is an emerging topic in the computer vision field, and it is a pivotal step toward unsupervised video understanding. However, the predictions generated from the state-of-the-art methods might be far from ideal quality, due to a lack of guidance from the labeled data of correct predictions (e.g., the annotated future pose of a person). Hence, building a network for better predicting future sequences in an unsupervised manner has to be further pursued. To this end, we put forth a novel Forward-Backward-Net (FB-Net) architecture, which delves deeper into spatiotemporal consistency. It first derives the forward consistency from the raw historical observations. In contrast to mainstream video forecasting approaches, FB-Net then investigates the backward consistency from the future to the past to reinforce the predictions. The final predicted results are inferred by jointly taking both the forward and backward consistencies into account. Moreover, we embed the motion dynamics and the visual content into a single framework via the FB-Net architecture, which significantly differs from learning each component throughout the videos separately. We evaluate our FB-Net on the large-scale KTH and UCF101 datasets. The experiments show that it can introduce considerable margin improvements with respect to most recent leading studies.\",\"PeriodicalId\":339857,\"journal\":{\"name\":\"Proceedings of the 26th ACM international conference on Multimedia\",\"volume\":\"30 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-10-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 26th ACM international conference on Multimedia\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3240508.3240551\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 26th ACM international conference on Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3240508.3240551","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

视频预测是计算机视觉领域的一个新兴课题，是实现无监督视频理解的关键一步。然而，由于缺乏正确预测的标记数据(例如，注释的人的未来姿势)的指导，从最先进的方法生成的预测可能远非理想的质量。因此，建立一个以无监督的方式更好地预测未来序列的网络必须进一步追求。为此，我们提出了一种新的向前-向后网络(FB-Net)架构，该架构更深入地研究了时空一致性。它首先从原始的历史观察中推导出向前的一致性。与主流视频预测方法相比，FB-Net随后研究了从未来到过去的向后一致性，以加强预测。最终的预测结果是综合考虑前向和后向一致性来推断的。此外，我们通过FB-Net架构将运动动力学和视觉内容嵌入到单个框架中，这与在整个视频中单独学习每个组件有很大不同。我们在大规模KTH和UCF101数据集上对FB-Net进行了评估。实验表明，相对于最近的领先研究，它可以引入相当大的边际改进。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Video Forecasting with Forward-Backward-Net: Delving Deeper into Spatiotemporal Consistency

Video forecasting is an emerging topic in the computer vision field, and it is a pivotal step toward unsupervised video understanding. However, the predictions generated from the state-of-the-art methods might be far from ideal quality, due to a lack of guidance from the labeled data of correct predictions (e.g., the annotated future pose of a person). Hence, building a network for better predicting future sequences in an unsupervised manner has to be further pursued. To this end, we put forth a novel Forward-Backward-Net (FB-Net) architecture, which delves deeper into spatiotemporal consistency. It first derives the forward consistency from the raw historical observations. In contrast to mainstream video forecasting approaches, FB-Net then investigates the backward consistency from the future to the past to reinforce the predictions. The final predicted results are inferred by jointly taking both the forward and backward consistencies into account. Moreover, we embed the motion dynamics and the visual content into a single framework via the FB-Net architecture, which significantly differs from learning each component throughout the videos separately. We evaluate our FB-Net on the large-scale KTH and UCF101 datasets. The experiments show that it can introduce considerable margin improvements with respect to most recent leading studies.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 26th ACM international conference on Multimedia

自引率

0.00%

发文量