Lu Chen;Zongtao He;Liuyi Wang;Chengju Liu;Qijun Chen
{"title":"Temporal Scene-Object Graph Learning for Object Navigation","authors":"Lu Chen;Zongtao He;Liuyi Wang;Chengju Liu;Qijun Chen","doi":"10.1109/LRA.2025.3553055","DOIUrl":"https://doi.org/10.1109/LRA.2025.3553055","url":null,"abstract":"Object navigation tasks require agents to locate target objects within unfamiliar indoor environments. However, the first-person perspective inherently imposes limited visibility, complicating global planning. Hence, it becomes imperative for the agent to cultivate an efficient visual representation from this restricted viewpoint. To address this, we introduce a temporal scene-object graph (TSOG) to construct an informative and efficient ego-centric visual representation. Firstly, we develop a holistic object feature descriptor (HOFD) to fully describe object features from different aspects, facilitating the learning of relationships between observed and unseen objects. Next, we propose a scene-object graph (SOG) to simultaneously learn local and global correlations between objects and agent observations, granting the agent a more comprehensive and flexible scene understanding ability. This facilitates the agent to perform target association and search more efficiently. Finally, we introduce a temporal graph aggregation (TGA) module to dynamically aggregate memory information across consecutive time steps. TGA offers the agent a dynamic perspective on historical steps, aiding in navigation towards the target in longer trajectories. Extensive experiments in AI2THOR and Gibson datasets demonstrate our method's effectiveness and efficiency for ObjectNav tasks in unseen environments.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"10 5","pages":"4914-4921"},"PeriodicalIF":4.6,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143792804","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ylenia Nisticò;João Carlos Virgolino Soares;Lorenzo Amatucci;Geoff Fink;Claudio Semini
{"title":"MUSE: A Real-Time Multi-Sensor State Estimator for Quadruped Robots","authors":"Ylenia Nisticò;João Carlos Virgolino Soares;Lorenzo Amatucci;Geoff Fink;Claudio Semini","doi":"10.1109/LRA.2025.3553047","DOIUrl":"https://doi.org/10.1109/LRA.2025.3553047","url":null,"abstract":"This letter introduces an innovative state estimator, MUSE (MUlti-sensor State Estimator), designed to enhance state estimation's accuracy and real-time performance in quadruped robot navigation. The proposed state estimator builds upon our previous work presented in (Fink et al. 2020). It integrates data from a range of onboard sensors, including IMUs, encoders, cameras, and LiDARs, to deliver a comprehensive and reliable estimation of the robot's pose and motion, even in slippery scenarios. We tested MUSE on a Unitree Aliengo robot, successfully closing the locomotion control loop in difficult scenarios, including slippery and uneven terrain. Benchmarking against Pronto (Camurri et al. 2020) and VILENS (Wisth et al. 2022) showed 67.6% and 26.7% reductions in translational errors, respectively. Additionally, MUSE outperformed DLIO (Chen et al. 2023), a LiDAR-inertial odometry system in rotational errors and frequency, while the proprioceptive version of MUSE (P-MUSE) outperformed TSIF [Bloesch et al. 2018], with a 45.9% reduction in absolute trajectory error (ATE).","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"10 5","pages":"4620-4627"},"PeriodicalIF":4.6,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143777843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"First-Person View Interfaces for Teleoperation of Aerial Swarms","authors":"Benjamin Jarvis;Charbel Toumieh;Dario Floreano","doi":"10.1109/LRA.2025.3553062","DOIUrl":"https://doi.org/10.1109/LRA.2025.3553062","url":null,"abstract":"Aerial swarms can substantially improve the effectiveness of drones in applications such as inspection, monitoring, and search for rescue. This is especially true when those swarms are made of several individual drones that use local sensing and coordination rules to achieve collective motion. Despite recent progress in swarm autonomy, human control and decision-making are still critical for missions where lives are at risk or human cognitive skills are required. However, first-person-view (FPV) teleoperation systems require one or more human operators per drone, limiting the scalability of these systems to swarms. This work investigates the performance, preference, and behaviour of pilots using different FPV interfaces for teleoperation of aerial swarms. Interfaces with single and multiple perspectives were experimentally studied with humans piloting a simulated aerial swarm through an obstacle course. Participants were found to prefer and perform better with views from the back of the swarm, while views from the front caused users to fly faster but resulted in more crashes. Presenting users with multiple views at once resulted in a slower completion time, and users were found to focus on the largest view, regardless of its perspective within the swarm.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"10 5","pages":"4476-4483"},"PeriodicalIF":4.6,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143716546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xinyu Zhou;Songhao Piao;Wenzheng Chi;Liguo Chen;Wei Li
{"title":"HeR-DRL:Heterogeneous Relational Deep Reinforcement Learning for Single-Robot and Multi-Robot Crowd Navigation","authors":"Xinyu Zhou;Songhao Piao;Wenzheng Chi;Liguo Chen;Wei Li","doi":"10.1109/LRA.2025.3553050","DOIUrl":"https://doi.org/10.1109/LRA.2025.3553050","url":null,"abstract":"Crowd navigation has garnered significant research attention in recent years, particularly with the advent of DRL-based methods. Current DRL-based methods have extensively explored interaction relationships in single-robot scenarios. However, the heterogeneity of multiple interaction relationships is often disregarded. This “interaction blind spot” hinders progress towards more complex scenarios, such as multi-robot crowd navigation. In this letter, we propose a heterogeneous relational deep reinforcement learning method, named HeR-DRL, which utilizes a customized heterogeneous Graph Neural Network (GNN) to enhance overall performance in crowd navigation. Firstly, we devised a method for constructing robot-crowd heterogenous relation graph that effectively simulates the heterogeneous pair-wise interaction relationships. Based on this graph, we proposed a novel heterogeneous GNN to encode interaction relationship information. Finally, we incorporate the encoded information into deep reinforcement learning to explore the optimal policy. HeR-DRL is rigorously evaluated by comparing it to state-of-the-art algorithms in both single-robot and multi-robot circle crossing scenarios. The experimental results demonstrate that HeR-DRL surpasses the state-of-the-art approaches in overall performance, particularly excelling in terms of efficiency and comfort. This underscores the significance of heterogeneous interactions in crowd navigation.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"10 5","pages":"4524-4531"},"PeriodicalIF":4.6,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143726443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Robust LiDAR-Camera Calibration With 2D Gaussian Splatting","authors":"Shuyi Zhou;Shuxiang Xie;Ryoichi Ishikawa;Takeshi Oishi","doi":"10.1109/LRA.2025.3552955","DOIUrl":"https://doi.org/10.1109/LRA.2025.3552955","url":null,"abstract":"LiDAR-camera systems have become increasingly popular in robotics recently. A critical and initial step in integrating the LiDAR and camera data is the calibration of the LiDAR-camera system. Most existing calibration methods rely on auxiliary target objects, which often involve complex manual operations, whereas targetless methods have yet to achieve practical effectiveness. Recognizing that 2D Gaussian Splatting (2DGS) can reconstruct geometric information from camera image sequences, we propose a calibration method that estimates LiDAR-camera extrinsic parameters using geometric constraints. The proposed method begins by reconstructing colorless 2DGS using LiDAR point clouds. Subsequently, we update the colors of the Gaussian splats by minimizing the photometric loss. The extrinsic parameters are optimized during this process. Additionally, we address the limitations of the photometric loss by incorporating the reprojection and triangulation losses, thereby enhancing the calibration robustness and accuracy.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"10 5","pages":"4674-4681"},"PeriodicalIF":4.6,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10933576","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143777853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhenzhong Cao;Chenyang Zhao;Qianyi Zhang;Jinzheng Guang;Yinuo Song;Jingtai Liu
{"title":"RGBDS-SLAM: A RGB-D Semantic Dense SLAM Based on 3D Multi Level Pyramid Gaussian Splatting","authors":"Zhenzhong Cao;Chenyang Zhao;Qianyi Zhang;Jinzheng Guang;Yinuo Song;Jingtai Liu","doi":"10.1109/LRA.2025.3553049","DOIUrl":"https://doi.org/10.1109/LRA.2025.3553049","url":null,"abstract":"High-fidelity reconstruction is crucial for dense SLAM. Recent popular methods utilize 3D Gaussian splatting (3D GS) techniques for RGB, depth, and semantic reconstruction of scenes. However, these methods ignore issues of detail and consistency in different parts of the scene. To address this, we propose RGBDS-SLAM, a RGB-D semantic dense SLAM system based on 3D multi-level pyramid Gaussian splatting, which enables high-fidelity dense reconstruction of scene RGB, depth, and semantics. In this system, we introduce a 3D multi-level pyramid Gaussian splatting method that restores scene details by extracting multi-level image pyramids for Gaussian splatting training, ensuring consistency in RGB, depth, and semantic reconstructions. Additionally, we design a tightly-coupled multi-features reconstruction optimization mechanism, allowing the reconstruction accuracy of RGB, depth, and semantic features to mutually enhance each other during the rendering optimization process. Extensive quantitative, qualitative, and ablation experiments on the Replica and ScanNet public datasets demonstrate that our proposed method outperforms current state-of-the-art methods, which achieves great improvement by 11.13% in PSNR and 68.57% in LPIPS.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"10 5","pages":"4778-4785"},"PeriodicalIF":4.6,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143777684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Pose Estimation of Magnetically Driven Helical Robots With Eye-in-Hand Magnetic Sensing","authors":"Yong Zeng;Guangyu Chen;Haoxiang Lian;Kun Bai","doi":"10.1109/LRA.2025.3553048","DOIUrl":"https://doi.org/10.1109/LRA.2025.3553048","url":null,"abstract":"This letter presents a magnetic-based pose sensing method for magnetically driven helical robots. Unlike conventional methods that directly compute pose from magnetic field measurements, the proposed approach decouples magnetic field components caused by the helical robot's pose from the rotating magnetic field by deriving the analytic relationship between the spatial characteristics of the rotating magnetic field and the rotating permanent magnet (PM). A magnetic field model for a dual-rotating PM system is established under quasi-static driving conditions, enabling real-time pose estimation by taking account into the effects of the driving PM. To address workspace and signal quality limitations, a mobile sensor array in eye-in-hand configuration is presented, achieving follow-up measurements with improved signal-to-noise ratio and high precision. The proposed method has been validated experimentally on a magnetically driving platform and the results demonstrate that this method enables large-range tracking with limited number of sensors and provides a robust solution for continuous real-time pose sensing for in magnetically driven helical robots.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"10 5","pages":"4604-4611"},"PeriodicalIF":4.6,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143777749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yangyang Zhang;Jialong Zhang;Xiaolong Qian;Yi Cen;Bowen Zhang;Jun Gong
{"title":"MuSCLe-Reg: Multi-Scale Contextual Embedding and Local Correspondence Rectification for Robust Two-Stage Point Cloud Registration","authors":"Yangyang Zhang;Jialong Zhang;Xiaolong Qian;Yi Cen;Bowen Zhang;Jun Gong","doi":"10.1109/LRA.2025.3551954","DOIUrl":"https://doi.org/10.1109/LRA.2025.3551954","url":null,"abstract":"Algorithm of outlier removal for learning-based 3D point cloud registration is usually regarded as a classification problem. The core for this to be successful is to learn the discriminative inlier/outlier feature representations. This letter proposes a two-stage efficient network (MuSCLe-Reg) with multi-scale local feature fusion embedding. Specifically, we have designed a two-stage registration architecture. Firstly, we construct a graph topology feature consisting of correspondences and their feature neighborhoods. Then, the feature representation of correspondences was enhanced through multi-scale feature mapping and fusion (MSF). In addition, we propose a local correspondence rectification strategy (LCR) based on feature neighbors to evaluate initial candidates and generate higher-quality correspondences. The experimental results on various datasets show that compared with existing learning-based algorithms, this network has better accuracy and stronger generalization ability. Especially, in robustness testing across varying numbers of correspondences in the 3DLoMatch dataset, the algorithm demonstrated superior estimation performance compared to current state-of-the-art registration techniques.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"10 5","pages":"4754-4761"},"PeriodicalIF":4.6,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143777752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Information-Theoretic Detection of Bimanual Interactions for Dual-Arm Robot Plan Generation","authors":"Elena Merlo;Marta Lagomarsino;Arash Ajoudani","doi":"10.1109/LRA.2025.3552216","DOIUrl":"https://doi.org/10.1109/LRA.2025.3552216","url":null,"abstract":"Programming by demonstration is a strategy to simplify the robot programming process for non-experts via human demonstrations. However, its adoption for bimanual tasks is an underexplored problem due to the complexity of hand coordination, which also hinders data recording. This letter presents a novel one-shot method for processing a single RGB video of a bimanual task demonstration to generate an execution plan for a dual-arm robotic system. To detect hand coordination policies, we apply Shannon's information theory to analyze the information flow between scene elements and leverage scene graph properties. The generated plan is a modular behavior tree that assumes different structures based on the desired arms coordination. We validated the effectiveness of this framework through multiple subject video demonstrations, which we collected and made open-source, and exploiting data from an external, publicly available dataset. Comparisons with existing methods revealed significant improvements in generating a centralized execution plan for coordinating two-arm systems.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"10 5","pages":"4532-4539"},"PeriodicalIF":4.6,"publicationDate":"2025-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143726442","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"3D Single Object Tracking With Cross-Modal Fusion Conflict Elimination","authors":"Yushi Yang;Wei Li;Ying Yao;Bo Zhou;Baojie Fan","doi":"10.1109/LRA.2025.3551951","DOIUrl":"https://doi.org/10.1109/LRA.2025.3551951","url":null,"abstract":"3D single object tracking based on point clouds is a key challenge in robotics and autonomous driving technology. Mainstream methods rely on point clouds for geometric matching or motion estimation between the target template and the search area. However, the lack of texture and the sparsity of incomplete point clouds make it difficult for unimodal trackers to distinguish objects with similar structures. To overcome the limitations of previous methods, this letter proposes a cross-modal fusion conflict elimination tracker (CCETrack). The point clouds collected by LiDAR provide accurate depth and shape information about the surrounding environment, while the camera sensor provides RGB images containing rich semantic and texture information. CCETrack fully leverages both modalities to track 3D objects. Specifically, to address cross-modal conflicts caused by heterogeneous sensors, we propose a global context alignment module that aligns RGB images with point clouds and generates enhanced image features. Then, a sparse feature enhancement module is designed to optimize voxelized point cloud features using the rich image features. In the feature fusion stage, both modalities are converted into BEV features, with the template and search area features fused separately. A self-attention mechanism is employed to establish bidirectional communication between regions. Our method maximizes the use of effective information and achieves state-of-the-art performance on the KITTI and nuScenes datasets through multimodal complementarity.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"10 5","pages":"4826-4833"},"PeriodicalIF":4.6,"publicationDate":"2025-03-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143792802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}