{"title":"VST3D-Net:基于视频的三维形状重构时空网络","authors":"Jinglun Yang, Guanglun Zhang, Youhua Li, Lu Yang","doi":"10.1109/IC3D51119.2020.9376350","DOIUrl":null,"url":null,"abstract":"In this paper, we propose the Video-based Spatio-Temporal 3D Network (VST3D-Net), which is a novel learning approach of viewpoint-invariant 3D shape reconstruction from monocular video. In our VST3D-Net, a spatial feature extraction subnetwork is designed to encode the local and global spatial relationships of the object in the image. The extracted latent spatial features have implicitly embedded both shape and pose information. Although a single view can also be used to recover a 3D shape, more rich shape information of the dynamic object can be explored and leveraged from video frames. To generate the viewpoint-free 3D shape, we design a temporal correlation feature extractor. It handles the temporal consistency of the shape and pose of the moving object simultaneously. Therefore, both the canonical 3D shape and the corresponding pose at different frame are recovered by the network. We validate our approach on the ShapeNet-based video dataset and ApolloCar3D dataset. The experimental results show the proposed VST3D-Net can outperform the state-of-the-art approaches both in accuracy and efficiency.","PeriodicalId":159318,"journal":{"name":"2020 International Conference on 3D Immersion (IC3D)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"VST3D-Net:Video-Based Spatio-Temporal Network for 3D Shape Reconstruction from a Video\",\"authors\":\"Jinglun Yang, Guanglun Zhang, Youhua Li, Lu Yang\",\"doi\":\"10.1109/IC3D51119.2020.9376350\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we propose the Video-based Spatio-Temporal 3D Network (VST3D-Net), which is a novel learning approach of viewpoint-invariant 3D shape reconstruction from monocular video. In our VST3D-Net, a spatial feature extraction subnetwork is designed to encode the local and global spatial relationships of the object in the image. The extracted latent spatial features have implicitly embedded both shape and pose information. Although a single view can also be used to recover a 3D shape, more rich shape information of the dynamic object can be explored and leveraged from video frames. To generate the viewpoint-free 3D shape, we design a temporal correlation feature extractor. It handles the temporal consistency of the shape and pose of the moving object simultaneously. Therefore, both the canonical 3D shape and the corresponding pose at different frame are recovered by the network. We validate our approach on the ShapeNet-based video dataset and ApolloCar3D dataset. The experimental results show the proposed VST3D-Net can outperform the state-of-the-art approaches both in accuracy and efficiency.\",\"PeriodicalId\":159318,\"journal\":{\"name\":\"2020 International Conference on 3D Immersion (IC3D)\",\"volume\":\"27 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-12-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 International Conference on 3D Immersion (IC3D)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IC3D51119.2020.9376350\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 International Conference on 3D Immersion (IC3D)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IC3D51119.2020.9376350","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
VST3D-Net:Video-Based Spatio-Temporal Network for 3D Shape Reconstruction from a Video
In this paper, we propose the Video-based Spatio-Temporal 3D Network (VST3D-Net), which is a novel learning approach of viewpoint-invariant 3D shape reconstruction from monocular video. In our VST3D-Net, a spatial feature extraction subnetwork is designed to encode the local and global spatial relationships of the object in the image. The extracted latent spatial features have implicitly embedded both shape and pose information. Although a single view can also be used to recover a 3D shape, more rich shape information of the dynamic object can be explored and leveraged from video frames. To generate the viewpoint-free 3D shape, we design a temporal correlation feature extractor. It handles the temporal consistency of the shape and pose of the moving object simultaneously. Therefore, both the canonical 3D shape and the corresponding pose at different frame are recovered by the network. We validate our approach on the ShapeNet-based video dataset and ApolloCar3D dataset. The experimental results show the proposed VST3D-Net can outperform the state-of-the-art approaches both in accuracy and efficiency.