Yukai Ma;Jianbiao Mei;Xuemeng Yang;Licheng Wen;Weihua Xu;Jiangning Zhang;Xingxing Zuo;Botian Shi;Yong Liu
{"title":"LiCROcc: Teach Radar for Accurate Semantic Occupancy Prediction Using LiDAR and Camera","authors":"Yukai Ma;Jianbiao Mei;Xuemeng Yang;Licheng Wen;Weihua Xu;Jiangning Zhang;Xingxing Zuo;Botian Shi;Yong Liu","doi":"10.1109/LRA.2024.3511427","DOIUrl":"https://doi.org/10.1109/LRA.2024.3511427","url":null,"abstract":"Semantic Scene Completion (SSC) is pivotal in autonomous driving perception, frequently confronted with the complexities of weather and illumination changes. The long-term strategy involves fusing multi-modal information to bolster the system's robustness. Radar, increasingly utilized for 3D target detection, is gradually replacing LiDAR in autonomous driving applications, offering a robust sensing alternative. In this letter, we focus on the potential of 3D radar in semantic scene completion, pioneering cross-modal refinement techniques for improved robustness against weather and illumination changes and enhancing SSC performance. Regarding model architecture, we propose a three-stage tight fusion approach on BEV to realize a fusion framework for point clouds and images. Based on this foundation, we designed three cross-modal distillation modules—CMRD, BRD, and PDD. Our approach enhances the performance in radar-only (R-LiCROcc) and radar-camera (RC-LiCROcc) settings by distilling to them the rich semantic and structural information of the fused features of LiDAR and camera. Finally, our LC-Fusion, R-LiCROcc and RC-LiCROcc achieve the best performance on the nuScenes-Occupancy dataset, with mIOU exceeding the baseline by 22.9%, 44.1%, and 15.5%, respectively.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"10 1","pages":"852-859"},"PeriodicalIF":4.6,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142825935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Robust Loop Closure by Textual Cues in Challenging Environments","authors":"Tongxing Jin;Thien-Minh Nguyen;Xinhang Xu;Yizhuo Yang;Shenghai Yuan;Jianping Li;Lihua Xie","doi":"10.1109/LRA.2024.3511397","DOIUrl":"https://doi.org/10.1109/LRA.2024.3511397","url":null,"abstract":"Loop closure is an important task in robot navigation. However, existing methods mostly rely on some implicit or heuristic features of the environment, which can still fail to work in common environments such as corridors, tunnels, and warehouses. Indeed, navigating in such featureless, degenerative, and repetitive (FDR) environments would also pose a significant challenge even for humans, but explicit text cues in the surroundings often provide the best assistance. This inspires us to propose a multi-modal loop closure method based on explicit human-readable textual cues in FDR environments. Specifically, our approach first extracts scene text entities based on Optical Character Recognition (OCR), then creates a \u0000<italic>local</i>\u0000 map of text cues based on accurate LiDAR odometry and finally identifies loop closure events by a graph-theoretic scheme. Experiment results demonstrate that this approach has superior performance over existing methods that rely solely on visual and LiDAR sensors.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"10 1","pages":"812-819"},"PeriodicalIF":4.6,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142825851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"4D-CS: Exploiting Cluster Prior for 4D Spatio-Temporal LiDAR Semantic Segmentation","authors":"Jiexi Zhong;Zhiheng Li;Yubo Cui;Zheng Fang","doi":"10.1109/LRA.2024.3511411","DOIUrl":"https://doi.org/10.1109/LRA.2024.3511411","url":null,"abstract":"Semantic segmentation of LiDAR points has significant value for autonomous driving and mobile robot systems. Most approaches explore spatio-temporal information of multi-scan to identify the semantic classes and motion states for each point. However, these methods often overlook the segmentation consistency in space and time, which may result in point clouds within the same object being predicted as different categories. To handle this issue, our core idea is to generate cluster labels across multiple frames that can reflect the complete spatial structure and temporal information of objects. These labels serve as explicit guidance for our dual-branch network, 4D-CS, which integrates point-based and cluster-based branches to enable more consistent segmentation. Specifically, in the point-based branch, we leverage historical knowledge to enrich the current feature through temporal fusion on multiple views. In the cluster-based branch, we propose a new strategy to produce cluster labels of foreground objects and apply them to gather point-wise information to derive cluster features. We then merge neighboring clusters across multiple scans to restore missing features due to occlusion. Finally, in the point-cluster fusion stage, we adaptively fuse the information from the two branches to optimize segmentation results. Extensive experiments confirm the effectiveness of the proposed method, and we achieve state-of-the-art results on the multi-scan semantic and moving object segmentation on SemanticKITTI and nuScenes datasets.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"10 1","pages":"468-475"},"PeriodicalIF":4.6,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142798041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multimodal Target Localization With Landmark-Aware Positioning for Urban Mobility","authors":"Naoki Hosomi;Yui Iioka;Shumpei Hatanaka;Teruhisa Misu;Kentaro Yamada;Nanami Tsukamoto;Shunsuke Kobayashi;Komei Sugiura","doi":"10.1109/LRA.2024.3511404","DOIUrl":"https://doi.org/10.1109/LRA.2024.3511404","url":null,"abstract":"Advancements in vehicle automation technology are expected to significantly impact how humans interact with vehicles. In this study, we propose a method to create user-friendly control interfaces for autonomous vehicles in urban environments. The proposed model predicts the vehicle's destination on the images captured by the vehicle's cameras based on high-level navigation instructions. Our data analysis found that users often specify the destination based on the relative positions of landmarks in a scene. The task is challenging because users can specify arbitrary destinations on roads, which do not have distinct visual characteristics for prediction. Thus, the model should consider relationships between landmarks and the ideal stopping position. Existing approaches only model the relationships between instructions and destinations and do not explicitly model the relative positional relationships between landmarks and destinations. To address this limitation, the proposed Target Regressor in Positioning (TRiP) model includes a novel loss function, Landmark-aware Absolute-Relative Target Position Loss, and two novel modules, Target Position Localizer and Multi-Resolution Referring Expression Comprehension Feature Extractor. To validate TRiP, we built a new dataset by extending an existing dataset of referring expression comprehension. The model was evaluated on the dataset using a standard metric, and the results showed that TRiP significantly outperformed the baseline method.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"10 1","pages":"716-723"},"PeriodicalIF":4.6,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142821140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jun Wu;Sicheng Li;Sihui Ji;Yifei Yang;Yue Wang;Rong Xiong;Yiyi Liao
{"title":"DORec: Decomposed Object Reconstruction and Segmentation Utilizing 2D Self-Supervised Features","authors":"Jun Wu;Sicheng Li;Sihui Ji;Yifei Yang;Yue Wang;Rong Xiong;Yiyi Liao","doi":"10.1109/LRA.2024.3511425","DOIUrl":"https://doi.org/10.1109/LRA.2024.3511425","url":null,"abstract":"Recovering 3D geometry and textures of individual objects is crucial for many robotics applications, such as manipulation, pose estimation, and autonomous driving. However, decomposing a target object from a complex background is challenging. Most existing approaches rely on costly manual labels to acquire object instance perception. Recent advancements in 2D self-supervised learning offer new prospects for identifying objects of interest, yet leveraging such noisy 2D features for clean decomposition remains difficult. In this paper, we propose a Decomposed Object Reconstruction (DORec) network based on neural implicit representations. Our key idea is to use 2D self-supervised features to create two levels of masks for supervision: a binary mask for foreground regions and a K-cluster mask for semantically similar regions. These complementary masks result in robust decomposition. Experimental results on different datasets show DORec's superiority in segmenting and reconstructing diverse foreground objects from varied backgrounds enabling downstream tasks such as pose estimation.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"10 1","pages":"804-811"},"PeriodicalIF":4.6,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142825968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-Agent Path Finding With Heterogeneous Geometric and Kinematic Constraints in Continuous Space","authors":"Wenbo Lin;Wei Song;Qiuguo Zhu;Shiqiang Zhu","doi":"10.1109/LRA.2024.3511435","DOIUrl":"https://doi.org/10.1109/LRA.2024.3511435","url":null,"abstract":"Multi-Agent Path Finding (MAPF) represents a pivotal area of research within multi-agent systems. Existing algorithms typically discretize the movement space of agents into grid or topology, neglecting agents' geometric characteristics and kinematic constraints. This limitation hampers their applicability and efficiency in practical industrial scenarios. In this paper, we propose a Priority-Based Search algorithm for heterogeneous mobile robots working in continuous space, addressing both geometric and kinematic constraints. This algorithm, named Continuous-space Heterogeneous Priority-Based Search (CHPBS), employs a two-level search structure and a priority tree for collision detection. To expedite single-agent path finding in continuous space, we introduce a Weighted Hybrid Safe Interval Path Planning algorithm (WHSIPP\u0000<inline-formula><tex-math>$_{d}$</tex-math></inline-formula>\u0000). Furthermore, we present three strategies to enhance our algorithm, collectively termed Enhanced-CHPBS (ECHPBS): Partial Expansion, Target Reasoning, and Adaptive Induced Priority. Comparative analysis against two baseline algorithms on a specialized benchmark demonstrates that ECHPBS achieves a success rate of 100% on a 100 m × 100 m map featuring 50 agents, with an average runtime of under 1 s, and maintains the same 100% success rate on a 300 m × 300 m map with 100 agents.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"10 1","pages":"492-499"},"PeriodicalIF":4.6,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142821220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"EfficientNet-EA for Visual Location Recognition in Natural Scenes","authors":"Heng Zhang;Yanchao Chen;Yanli Liu","doi":"10.1109/LRA.2024.3511379","DOIUrl":"https://doi.org/10.1109/LRA.2024.3511379","url":null,"abstract":"In natural scenarios, the visual location recognition often experiences reduced accuracy because of variations in weather, lighting, camera angles, and occlusions caused by dynamic objects. This paper introduces an EfficientNet-EA-based algorithm specifically designed to tackle these challenges. The algorithm enhances its capabilities by appending the Efficient Feature Aggregation (EA) layer to the end of EfficientNet and by using MultiSimilarityLoss for training purposes. This design enhances the model's ability to extract features, thereby boosting efficiency and accuracy. During the training phase, the model adeptly identifies and utilizes hard-negative and challenging positive samples, which in turn enhances its training efficacy and generalizability across diverse situations. The experimental results indicate that EfficientNet-EA achieves a recall@10 of 98.6% on Pitts30k-test. The model demonstrates a certain degree of improvement in recognition rates under weather variations, changes in illumination, shifts in perspective, and the presence of dynamic object occlusions.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"10 1","pages":"596-603"},"PeriodicalIF":4.6,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiwei Nie;Dingyu Xue;Feng Pan;Shuai Cheng;Wei Liu;Jun Hu;Zuotao Ning
{"title":"MIXVPR++: Enhanced Visual Place Recognition With Hierarchical-Region Feature-Mixer and Adaptive Gabor Texture Fuser","authors":"Jiwei Nie;Dingyu Xue;Feng Pan;Shuai Cheng;Wei Liu;Jun Hu;Zuotao Ning","doi":"10.1109/LRA.2024.3511416","DOIUrl":"https://doi.org/10.1109/LRA.2024.3511416","url":null,"abstract":"Visual Place Recognition (VPR) is crucial for various computer vision and robotics applications. Traditional VPR techniques relying on handcrafted features, have been enhanced by using Convolutional Neural Networks (CNNs). Recently, MixVPR has set new benchmarks in VPR by using advanced feature aggregation techniques. However, MixVPR's full-image feature mixing approach can lead to the ignoring of critical local detail information and regional saliency information in large-scale images. To overcome this, we propose MIXVPR++, which integrates an Adaptive Gabor Texture Fuser with a Learnable Gabor Filter for enriching semantic context with texture details information and a Hierarchical-Region Feature-Mixer for better spatial hierarchy capture regional saliency information, thereby enhancing robustness and accuracy. Extensive experiments demonstrate that MIXVPR++ outperforms state-of-the-art methods across most challenging benchmarks. Despite its impressive performance, MIXVPR++ shows limitations in handling severe viewpoint changes, indicating an area for future improvement.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"10 1","pages":"580-587"},"PeriodicalIF":4.6,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142821243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yipeng Lu;Yifan Zhao;Haiping Wang;Zhiwei Ruan;Yuan Liu;Zhen Dong;Bisheng Yang
{"title":"Exploiting Motion Prior for Accurate Pose Estimation of Dashboard Cameras","authors":"Yipeng Lu;Yifan Zhao;Haiping Wang;Zhiwei Ruan;Yuan Liu;Zhen Dong;Bisheng Yang","doi":"10.1109/LRA.2024.3511381","DOIUrl":"https://doi.org/10.1109/LRA.2024.3511381","url":null,"abstract":"Dashboard cameras (dashcams) record millions of driving videos daily, offering a valuable potential data source for various applications, including driving map production and updates. A necessary step for utilizing these dashcam data involves the estimation of camera poses. However, the low-quality images captured by dashcams, characterized by motion blurs and dynamic objects, pose challenges for existing image-matching methods in accurately estimating camera poses. In this study, we propose a precise pose estimation method for dashcam images, leveraging the inherent camera motion prior. Typically, image sequences captured by dash cameras exhibit pronounced motion prior, such as forward movement or lateral turns, which serve as essential cues for correspondence estimation. Building upon this observation, we devise a pose regression module aimed at learning camera motion prior, subsequently integrating these prior into both correspondences and pose estimation processes. The experiment shows that, in real dashcams dataset, our method is 22% better than the baseline for pose estimation in AUC5°, and it can estimate poses for 19% more images with less reprojection error in Structure from Motion (SfM).","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"10 1","pages":"764-771"},"PeriodicalIF":4.6,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142821192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ning Hao;Fenghua He;Yi Hou;Wanpeng Song;Dong Xu;Yu Yao
{"title":"Decentralized Cooperative Localization: A Communication-Efficient Dual-Fusion Consistent Approach","authors":"Ning Hao;Fenghua He;Yi Hou;Wanpeng Song;Dong Xu;Yu Yao","doi":"10.1109/LRA.2024.3511413","DOIUrl":"https://doi.org/10.1109/LRA.2024.3511413","url":null,"abstract":"Decentralized cooperative localization poses significant challenges in managing inter-robot correlations, especially in environments with limited communication capacity and unreliable network connectivity. In this letter, we propose a communication-efficient decentralized consistent cooperative localization approach with almost minimal requirements for storage, communication, and network connectivity. A dual-fusion framework that integrates heterogeneous and homogeneous fusion is presented. In this framework, each robot only tracks its own local state and exchanges local estimates with its neighboring robots that possess relative measurements. In the heterogeneous fusion stage, we present an MAP-based decentralized fusion approach to fuse prior estimates of multiple heterogeneous states received from neighboring observed robots and nonlinear measurements in the presence of unknown cross-correlations. In the homogeneous fusion stage, the estimates from neighboring observing robots are further fused based on the CI technique, fully exploiting all available information and thus yielding better estimation results. The proposed algorithm is proved to be consistent. Extensive Monte Carlo simulations and real-world experiments demonstrate that our approach outperforms state-of-the-art methods.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"10 1","pages":"636-643"},"PeriodicalIF":4.6,"publicationDate":"2024-12-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}