{"title":"基于前向-后向网络的视频预测:更深入的时空一致性研究","authors":"Yuke Li","doi":"10.1145/3240508.3240551","DOIUrl":null,"url":null,"abstract":"Video forecasting is an emerging topic in the computer vision field, and it is a pivotal step toward unsupervised video understanding. However, the predictions generated from the state-of-the-art methods might be far from ideal quality, due to a lack of guidance from the labeled data of correct predictions (e.g., the annotated future pose of a person). Hence, building a network for better predicting future sequences in an unsupervised manner has to be further pursued. To this end, we put forth a novel Forward-Backward-Net (FB-Net) architecture, which delves deeper into spatiotemporal consistency. It first derives the forward consistency from the raw historical observations. In contrast to mainstream video forecasting approaches, FB-Net then investigates the backward consistency from the future to the past to reinforce the predictions. The final predicted results are inferred by jointly taking both the forward and backward consistencies into account. Moreover, we embed the motion dynamics and the visual content into a single framework via the FB-Net architecture, which significantly differs from learning each component throughout the videos separately. We evaluate our FB-Net on the large-scale KTH and UCF101 datasets. The experiments show that it can introduce considerable margin improvements with respect to most recent leading studies.","PeriodicalId":339857,"journal":{"name":"Proceedings of the 26th ACM international conference on Multimedia","volume":"30 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Video Forecasting with Forward-Backward-Net: Delving Deeper into Spatiotemporal Consistency\",\"authors\":\"Yuke Li\",\"doi\":\"10.1145/3240508.3240551\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Video forecasting is an emerging topic in the computer vision field, and it is a pivotal step toward unsupervised video understanding. However, the predictions generated from the state-of-the-art methods might be far from ideal quality, due to a lack of guidance from the labeled data of correct predictions (e.g., the annotated future pose of a person). Hence, building a network for better predicting future sequences in an unsupervised manner has to be further pursued. To this end, we put forth a novel Forward-Backward-Net (FB-Net) architecture, which delves deeper into spatiotemporal consistency. It first derives the forward consistency from the raw historical observations. In contrast to mainstream video forecasting approaches, FB-Net then investigates the backward consistency from the future to the past to reinforce the predictions. The final predicted results are inferred by jointly taking both the forward and backward consistencies into account. Moreover, we embed the motion dynamics and the visual content into a single framework via the FB-Net architecture, which significantly differs from learning each component throughout the videos separately. We evaluate our FB-Net on the large-scale KTH and UCF101 datasets. The experiments show that it can introduce considerable margin improvements with respect to most recent leading studies.\",\"PeriodicalId\":339857,\"journal\":{\"name\":\"Proceedings of the 26th ACM international conference on Multimedia\",\"volume\":\"30 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-10-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 26th ACM international conference on Multimedia\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3240508.3240551\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 26th ACM international conference on Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3240508.3240551","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Video Forecasting with Forward-Backward-Net: Delving Deeper into Spatiotemporal Consistency
Video forecasting is an emerging topic in the computer vision field, and it is a pivotal step toward unsupervised video understanding. However, the predictions generated from the state-of-the-art methods might be far from ideal quality, due to a lack of guidance from the labeled data of correct predictions (e.g., the annotated future pose of a person). Hence, building a network for better predicting future sequences in an unsupervised manner has to be further pursued. To this end, we put forth a novel Forward-Backward-Net (FB-Net) architecture, which delves deeper into spatiotemporal consistency. It first derives the forward consistency from the raw historical observations. In contrast to mainstream video forecasting approaches, FB-Net then investigates the backward consistency from the future to the past to reinforce the predictions. The final predicted results are inferred by jointly taking both the forward and backward consistencies into account. Moreover, we embed the motion dynamics and the visual content into a single framework via the FB-Net architecture, which significantly differs from learning each component throughout the videos separately. We evaluate our FB-Net on the large-scale KTH and UCF101 datasets. The experiments show that it can introduce considerable margin improvements with respect to most recent leading studies.