Ryosuke Shimoya, Takashi Morimoto, J. van Baar, P. Boufounos, Yanting Ma, Hassan Mansour
{"title":"学习多模态图像的闭塞感知密集对应","authors":"Ryosuke Shimoya, Takashi Morimoto, J. van Baar, P. Boufounos, Yanting Ma, Hassan Mansour","doi":"10.1109/AVSS56176.2022.9959354","DOIUrl":null,"url":null,"abstract":"We introduce a scalable multi-modal approach to learn dense, i.e., pixel-level, correspondences and occlusion maps, between images in a video sequence. The problems of finding dense correspondences and occlusion maps are fundamental in computer vision. In this work we jointly train a deep network to tackle both, with a shared feature extraction stage. We use depth and color images with ground truth optical flow and occlusion maps to train the network end-to-end. From the multi-modal input, the network learns to estimate occlusion maps, optical flows, and a correspondence embedding providing a meaningful latent feature space. We evaluate the performance on a dataset of images derived from synthetic characters, and perform a thorough ablation study to demonstrate that the proposed components of our architecture combine to achieve the lowest correspondence error. The scalability of our proposed method comes from the ability to incorporate additional modalities, e.g., infrared images.","PeriodicalId":408581,"journal":{"name":"2022 18th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Learning Occlusion-Aware Dense Correspondences for Multi-Modal Images\",\"authors\":\"Ryosuke Shimoya, Takashi Morimoto, J. van Baar, P. Boufounos, Yanting Ma, Hassan Mansour\",\"doi\":\"10.1109/AVSS56176.2022.9959354\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We introduce a scalable multi-modal approach to learn dense, i.e., pixel-level, correspondences and occlusion maps, between images in a video sequence. The problems of finding dense correspondences and occlusion maps are fundamental in computer vision. In this work we jointly train a deep network to tackle both, with a shared feature extraction stage. We use depth and color images with ground truth optical flow and occlusion maps to train the network end-to-end. From the multi-modal input, the network learns to estimate occlusion maps, optical flows, and a correspondence embedding providing a meaningful latent feature space. We evaluate the performance on a dataset of images derived from synthetic characters, and perform a thorough ablation study to demonstrate that the proposed components of our architecture combine to achieve the lowest correspondence error. The scalability of our proposed method comes from the ability to incorporate additional modalities, e.g., infrared images.\",\"PeriodicalId\":408581,\"journal\":{\"name\":\"2022 18th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)\",\"volume\":\"27 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 18th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/AVSS56176.2022.9959354\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 18th IEEE International Conference on Advanced Video and Signal Based Surveillance (AVSS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/AVSS56176.2022.9959354","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Learning Occlusion-Aware Dense Correspondences for Multi-Modal Images
We introduce a scalable multi-modal approach to learn dense, i.e., pixel-level, correspondences and occlusion maps, between images in a video sequence. The problems of finding dense correspondences and occlusion maps are fundamental in computer vision. In this work we jointly train a deep network to tackle both, with a shared feature extraction stage. We use depth and color images with ground truth optical flow and occlusion maps to train the network end-to-end. From the multi-modal input, the network learns to estimate occlusion maps, optical flows, and a correspondence embedding providing a meaningful latent feature space. We evaluate the performance on a dataset of images derived from synthetic characters, and perform a thorough ablation study to demonstrate that the proposed components of our architecture combine to achieve the lowest correspondence error. The scalability of our proposed method comes from the ability to incorporate additional modalities, e.g., infrared images.