{"title":"基于多层次特征融合的场景无关相机定位和类别级目标姿态估计","authors":"Wang Junyi, Yue Qi","doi":"10.1109/VR55154.2023.00041","DOIUrl":null,"url":null,"abstract":"In AR/MR applications, camera localization and object pose estimation both play crucial roles. The universality of learning techniques, often referred to as scene-independent localization and category-level pose estimation, presents challenges for both tasks. The two missions maintain close relationships due to the spatial geometry constraint, but differing task requirements result in distinct feature extraction. In this paper, we focus on simultaneous scene-independent camera localization and category-level object pose estimation with a unified learning framework. The system consists of a localization branch called SLO-LocNet, a pose estimation branch called SLO-ObjNet, a feature fusion module for feature sharing between two tasks, and two decoders for creating coordinate maps. In SLO-LocNet, localization features are produced for anticipating the relative pose between two adjusted frames using inputs of color and depth images. Furthermore, we establish an image fusion module in order to promote feature sharing in depth and color branches. With SLO-ObjNet, we take the detected depth image and its corresponding point cloud as inputs, and produce object pose features for pose estimation. A geometry fusion module is created to combine depth and point cloud information simultaneously. Between the two tasks, the image fusion module is also exploited to accomplish feature sharing. In terms of the loss function, we present a mixed optimization function that is composed of the relative camera pose, geometry constraint, absolute and relative object pose terms. To verify how well our algorithm could perform, we conduct experiments on both localization and pose estimation datasets, covering 7 Scenes, ScanNet, REAL275 and YCB-Video. All experiments demonstrate superior performance to other existing methods. We specifically train the network on ScanNet and test it on 7 Scenes to demonstrate the universality performance. Additionally, the positive effects of fusion modules and loss function are also demonstrated.","PeriodicalId":346767,"journal":{"name":"2023 IEEE Conference Virtual Reality and 3D User Interfaces (VR)","volume":" 32","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Simultaneous Scene-independent Camera Localization and Category-level Object Pose Estimation via Multi-level Feature Fusion\",\"authors\":\"Wang Junyi, Yue Qi\",\"doi\":\"10.1109/VR55154.2023.00041\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In AR/MR applications, camera localization and object pose estimation both play crucial roles. The universality of learning techniques, often referred to as scene-independent localization and category-level pose estimation, presents challenges for both tasks. The two missions maintain close relationships due to the spatial geometry constraint, but differing task requirements result in distinct feature extraction. In this paper, we focus on simultaneous scene-independent camera localization and category-level object pose estimation with a unified learning framework. The system consists of a localization branch called SLO-LocNet, a pose estimation branch called SLO-ObjNet, a feature fusion module for feature sharing between two tasks, and two decoders for creating coordinate maps. In SLO-LocNet, localization features are produced for anticipating the relative pose between two adjusted frames using inputs of color and depth images. Furthermore, we establish an image fusion module in order to promote feature sharing in depth and color branches. With SLO-ObjNet, we take the detected depth image and its corresponding point cloud as inputs, and produce object pose features for pose estimation. A geometry fusion module is created to combine depth and point cloud information simultaneously. Between the two tasks, the image fusion module is also exploited to accomplish feature sharing. In terms of the loss function, we present a mixed optimization function that is composed of the relative camera pose, geometry constraint, absolute and relative object pose terms. To verify how well our algorithm could perform, we conduct experiments on both localization and pose estimation datasets, covering 7 Scenes, ScanNet, REAL275 and YCB-Video. All experiments demonstrate superior performance to other existing methods. We specifically train the network on ScanNet and test it on 7 Scenes to demonstrate the universality performance. Additionally, the positive effects of fusion modules and loss function are also demonstrated.\",\"PeriodicalId\":346767,\"journal\":{\"name\":\"2023 IEEE Conference Virtual Reality and 3D User Interfaces (VR)\",\"volume\":\" 32\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2023 IEEE Conference Virtual Reality and 3D User Interfaces (VR)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/VR55154.2023.00041\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 IEEE Conference Virtual Reality and 3D User Interfaces (VR)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/VR55154.2023.00041","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Simultaneous Scene-independent Camera Localization and Category-level Object Pose Estimation via Multi-level Feature Fusion
In AR/MR applications, camera localization and object pose estimation both play crucial roles. The universality of learning techniques, often referred to as scene-independent localization and category-level pose estimation, presents challenges for both tasks. The two missions maintain close relationships due to the spatial geometry constraint, but differing task requirements result in distinct feature extraction. In this paper, we focus on simultaneous scene-independent camera localization and category-level object pose estimation with a unified learning framework. The system consists of a localization branch called SLO-LocNet, a pose estimation branch called SLO-ObjNet, a feature fusion module for feature sharing between two tasks, and two decoders for creating coordinate maps. In SLO-LocNet, localization features are produced for anticipating the relative pose between two adjusted frames using inputs of color and depth images. Furthermore, we establish an image fusion module in order to promote feature sharing in depth and color branches. With SLO-ObjNet, we take the detected depth image and its corresponding point cloud as inputs, and produce object pose features for pose estimation. A geometry fusion module is created to combine depth and point cloud information simultaneously. Between the two tasks, the image fusion module is also exploited to accomplish feature sharing. In terms of the loss function, we present a mixed optimization function that is composed of the relative camera pose, geometry constraint, absolute and relative object pose terms. To verify how well our algorithm could perform, we conduct experiments on both localization and pose estimation datasets, covering 7 Scenes, ScanNet, REAL275 and YCB-Video. All experiments demonstrate superior performance to other existing methods. We specifically train the network on ScanNet and test it on 7 Scenes to demonstrate the universality performance. Additionally, the positive effects of fusion modules and loss function are also demonstrated.