Haichao Liu, Ruoyu Yao, Zhenmin Huang, Shaojie Shen, Jun Ma
{"title":"LMMCoDrive: Cooperative Driving with Large Multimodal Model","authors":"Haichao Liu, Ruoyu Yao, Zhenmin Huang, Shaojie Shen, Jun Ma","doi":"arxiv-2409.11981","DOIUrl":"https://doi.org/arxiv-2409.11981","url":null,"abstract":"To address the intricate challenges of decentralized cooperative scheduling\u0000and motion planning in Autonomous Mobility-on-Demand (AMoD) systems, this paper\u0000introduces LMMCoDrive, a novel cooperative driving framework that leverages a\u0000Large Multimodal Model (LMM) to enhance traffic efficiency in dynamic urban\u0000environments. This framework seamlessly integrates scheduling and motion\u0000planning processes to ensure the effective operation of Cooperative Autonomous\u0000Vehicles (CAVs). The spatial relationship between CAVs and passenger requests\u0000is abstracted into a Bird's-Eye View (BEV) to fully exploit the potential of\u0000the LMM. Besides, trajectories are cautiously refined for each CAV while\u0000ensuring collision avoidance through safety constraints. A decentralized\u0000optimization strategy, facilitated by the Alternating Direction Method of\u0000Multipliers (ADMM) within the LMM framework, is proposed to drive the graph\u0000evolution of CAVs. Simulation results demonstrate the pivotal role and\u0000significant impact of LMM in optimizing CAV scheduling and enhancing\u0000decentralized cooperative optimization process for each vehicle. This marks a\u0000substantial stride towards achieving practical, efficient, and safe AMoD\u0000systems that are poised to revolutionize urban transportation. The code is\u0000available at https://github.com/henryhcliu/LMMCoDrive.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":"28 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jingwei Song, Ray Zhang, Wenwei Zhang, Hao Zhou, Maani Ghaffari
{"title":"SLAM assisted 3D tracking system for laparoscopic surgery","authors":"Jingwei Song, Ray Zhang, Wenwei Zhang, Hao Zhou, Maani Ghaffari","doi":"arxiv-2409.11688","DOIUrl":"https://doi.org/arxiv-2409.11688","url":null,"abstract":"A major limitation of minimally invasive surgery is the difficulty in\u0000accurately locating the internal anatomical structures of the target organ due\u0000to the lack of tactile feedback and transparency. Augmented reality (AR) offers\u0000a promising solution to overcome this challenge. Numerous studies have shown\u0000that combining learning-based and geometric methods can achieve accurate\u0000preoperative and intraoperative data registration. This work proposes a\u0000real-time monocular 3D tracking algorithm for post-registration tasks. The\u0000ORB-SLAM2 framework is adopted and modified for prior-based 3D tracking. The\u0000primitive 3D shape is used for fast initialization of the monocular SLAM. A\u0000pseudo-segmentation strategy is employed to separate the target organ from the\u0000background for tracking purposes, and the geometric prior of the 3D shape is\u0000incorporated as an additional constraint in the pose graph. Experiments from\u0000in-vivo and ex-vivo tests demonstrate that the proposed 3D tracking system\u0000provides robust 3D tracking and effectively handles typical challenges such as\u0000fast motion, out-of-field-of-view scenarios, partial visibility, and\u0000\"organ-background\" relative motion.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":"33 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kelin Li, Shubham M Wagh, Nitish Sharma, Saksham Bhadani, Wei Chen, Chang Liu, Petar Kormushev
{"title":"Haptic-ACT: Bridging Human Intuition with Compliant Robotic Manipulation via Immersive VR","authors":"Kelin Li, Shubham M Wagh, Nitish Sharma, Saksham Bhadani, Wei Chen, Chang Liu, Petar Kormushev","doi":"arxiv-2409.11925","DOIUrl":"https://doi.org/arxiv-2409.11925","url":null,"abstract":"Robotic manipulation is essential for the widespread adoption of robots in\u0000industrial and home settings and has long been a focus within the robotics\u0000community. Advances in artificial intelligence have introduced promising\u0000learning-based methods to address this challenge, with imitation learning\u0000emerging as particularly effective. However, efficiently acquiring high-quality\u0000demonstrations remains a challenge. In this work, we introduce an immersive\u0000VR-based teleoperation setup designed to collect demonstrations from a remote\u0000human user. We also propose an imitation learning framework called Haptic\u0000Action Chunking with Transformers (Haptic-ACT). To evaluate the platform, we\u0000conducted a pick-and-place task and collected 50 demonstration episodes.\u0000Results indicate that the immersive VR platform significantly reduces\u0000demonstrator fingertip forces compared to systems without haptic feedback,\u0000enabling more delicate manipulation. Additionally, evaluations of the\u0000Haptic-ACT framework in both the MuJoCo simulator and on a real robot\u0000demonstrate its effectiveness in teaching robots more compliant manipulation\u0000compared to the original ACT. Additional materials are available at\u0000https://sites.google.com/view/hapticact.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":"119 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266825","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Youssef Mohamed, Severin Lemaignan, Arzu Guneysu, Patric Jensfelt, Christian Smith
{"title":"Fusion in Context: A Multimodal Approach to Affective State Recognition","authors":"Youssef Mohamed, Severin Lemaignan, Arzu Guneysu, Patric Jensfelt, Christian Smith","doi":"arxiv-2409.11906","DOIUrl":"https://doi.org/arxiv-2409.11906","url":null,"abstract":"Accurate recognition of human emotions is a crucial challenge in affective\u0000computing and human-robot interaction (HRI). Emotional states play a vital role\u0000in shaping behaviors, decisions, and social interactions. However, emotional\u0000expressions can be influenced by contextual factors, leading to\u0000misinterpretations if context is not considered. Multimodal fusion, combining\u0000modalities like facial expressions, speech, and physiological signals, has\u0000shown promise in improving affect recognition. This paper proposes a\u0000transformer-based multimodal fusion approach that leverages facial thermal\u0000data, facial action units, and textual context information for context-aware\u0000emotion recognition. We explore modality-specific encoders to learn tailored\u0000representations, which are then fused using additive fusion and processed by a\u0000shared transformer encoder to capture temporal dependencies and interactions.\u0000The proposed method is evaluated on a dataset collected from participants\u0000engaged in a tangible tabletop Pacman game designed to induce various affective\u0000states. Our results demonstrate the effectiveness of incorporating contextual\u0000information and multimodal fusion for affective state recognition.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":"29 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jaehyung Jung, Simon Boche, Sebastian Barbas Laina, Stefan Leutenegger
{"title":"Uncertainty-Aware Visual-Inertial SLAM with Volumetric Occupancy Mapping","authors":"Jaehyung Jung, Simon Boche, Sebastian Barbas Laina, Stefan Leutenegger","doi":"arxiv-2409.12051","DOIUrl":"https://doi.org/arxiv-2409.12051","url":null,"abstract":"We propose visual-inertial simultaneous localization and mapping that tightly\u0000couples sparse reprojection errors, inertial measurement unit pre-integrals,\u0000and relative pose factors with dense volumetric occupancy mapping. Hereby depth\u0000predictions from a deep neural network are fused in a fully probabilistic\u0000manner. Specifically, our method is rigorously uncertainty-aware: first, we use\u0000depth and uncertainty predictions from a deep network not only from the robot's\u0000stereo rig, but we further probabilistically fuse motion stereo that provides\u0000depth information across a range of baselines, therefore drastically increasing\u0000mapping accuracy. Next, predicted and fused depth uncertainty propagates not\u0000only into occupancy probabilities but also into alignment factors between\u0000generated dense submaps that enter the probabilistic nonlinear least squares\u0000estimator. This submap representation offers globally consistent geometry at\u0000scale. Our method is thoroughly evaluated in two benchmark datasets, resulting\u0000in localization and mapping accuracy that exceeds the state of the art, while\u0000simultaneously offering volumetric occupancy directly usable for downstream\u0000robotic planning and control in real-time.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":"41 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A machine learning framework for acoustic reflector mapping","authors":"Usama Saqib, Letizia Marchegiani, Jesper Rindom Jensen","doi":"arxiv-2409.12094","DOIUrl":"https://doi.org/arxiv-2409.12094","url":null,"abstract":"Sonar-based indoor mapping systems have been widely employed in robotics for\u0000several decades. While such systems are still the mainstream in underwater and\u0000pipe inspection settings, the vulnerability to noise reduced, over time, their\u0000general widespread usage in favour of other modalities(textit{e.g.}, cameras,\u0000lidars), whose technologies were encountering, instead, extraordinary\u0000advancements. Nevertheless, mapping physical environments using acoustic\u0000signals and echolocation can bring significant benefits to robot navigation in\u0000adverse scenarios, thanks to their complementary characteristics compared to\u0000other sensors. Cameras and lidars, indeed, struggle in harsh weather\u0000conditions, when dealing with lack of illumination, or with non-reflective\u0000walls. Yet, for acoustic sensors to be able to generate accurate maps, noise\u0000has to be properly and effectively handled. Traditional signal processing\u0000techniques are not always a solution in those cases. In this paper, we propose\u0000a framework where machine learning is exploited to aid more traditional signal\u0000processing methods to cope with background noise, by removing outliers and\u0000artefacts from the generated maps using acoustic sensors. Our goal is to\u0000demonstrate that the performance of traditional echolocation mapping techniques\u0000can be greatly enhanced, even in particularly noisy conditions, facilitating\u0000the employment of acoustic sensors in state-of-the-art multi-modal robot\u0000navigation systems. Our simulated evaluation demonstrates that the system can\u0000reliably operate at an SNR of $-10$dB. Moreover, we also show that the proposed\u0000method is capable of operating in different reverberate environments. In this\u0000paper, we also use the proposed method to map the outline of a simulated room\u0000using a robotic platform.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267030","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Discovering Conceptual Knowledge with Analytic Ontology Templates for Articulated Objects","authors":"Jianhua Sun, Yuxuan Li, Longfei Xu, Jiude Wei, Liang Chai, Cewu Lu","doi":"arxiv-2409.11702","DOIUrl":"https://doi.org/arxiv-2409.11702","url":null,"abstract":"Human cognition can leverage fundamental conceptual knowledge, like geometric\u0000and kinematic ones, to appropriately perceive, comprehend and interact with\u0000novel objects. Motivated by this finding, we aim to endow machine intelligence\u0000with an analogous capability through performing at the conceptual level, in\u0000order to understand and then interact with articulated objects, especially for\u0000those in novel categories, which is challenging due to the intricate geometric\u0000structures and diverse joint types of articulated objects. To achieve this\u0000goal, we propose Analytic Ontology Template (AOT), a parameterized and\u0000differentiable program description of generalized conceptual ontologies. A\u0000baseline approach called AOTNet driven by AOTs is designed accordingly to equip\u0000intelligent agents with these generalized concepts, and then empower the agents\u0000to effectively discover the conceptual knowledge on the structure and\u0000affordance of articulated objects. The AOT-driven approach yields benefits in\u0000three key perspectives: i) enabling concept-level understanding of articulated\u0000objects without relying on any real training data, ii) providing analytic\u0000structure information, and iii) introducing rich affordance information\u0000indicating proper ways of interaction. We conduct exhaustive experiments and\u0000the results demonstrate the superiority of our approach in understanding and\u0000then interacting with articulated objects.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142266858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Online Refractive Camera Model Calibration in Visual Inertial Odometry","authors":"Mohit Singh, Kostas Alexis","doi":"arxiv-2409.12074","DOIUrl":"https://doi.org/arxiv-2409.12074","url":null,"abstract":"This paper presents a general refractive camera model and online\u0000co-estimation of odometry and the refractive index of unknown media. This\u0000enables operation in diverse and varying refractive fluids, given only the\u0000camera calibration in air. The refractive index is estimated online as a state\u0000variable of a monocular visual-inertial odometry framework in an iterative\u0000formulation using the proposed camera model. The method was verified on data\u0000collected using an underwater robot traversing inside a pool. The evaluations\u0000demonstrate convergence to the ideal refractive index for water despite\u0000significant perturbations in the initialization. Simultaneously, the approach\u0000enables on-par visual-inertial odometry performance in refractive media without\u0000prior knowledge of the refractive index or requirement of medium-specific\u0000camera calibration.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":"34 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kejia Chen, Zheng Shen, Yue Zhang, Lingyun Chen, Fan Wu, Zhenshan Bing, Sami Haddadin, Alois Knoll
{"title":"Learning Task Planning from Multi-Modal Demonstration for Multi-Stage Contact-Rich Manipulation","authors":"Kejia Chen, Zheng Shen, Yue Zhang, Lingyun Chen, Fan Wu, Zhenshan Bing, Sami Haddadin, Alois Knoll","doi":"arxiv-2409.11863","DOIUrl":"https://doi.org/arxiv-2409.11863","url":null,"abstract":"Large Language Models (LLMs) have gained popularity in task planning for\u0000long-horizon manipulation tasks. To enhance the validity of LLM-generated\u0000plans, visual demonstrations and online videos have been widely employed to\u0000guide the planning process. However, for manipulation tasks involving subtle\u0000movements but rich contact interactions, visual perception alone may be\u0000insufficient for the LLM to fully interpret the demonstration. Additionally,\u0000visual data provides limited information on force-related parameters and\u0000conditions, which are crucial for effective execution on real robots. In this paper, we introduce an in-context learning framework that\u0000incorporates tactile and force-torque information from human demonstrations to\u0000enhance LLMs' ability to generate plans for new task scenarios. We propose a\u0000bootstrapped reasoning pipeline that sequentially integrates each modality into\u0000a comprehensive task plan. This task plan is then used as a reference for\u0000planning in new task configurations. Real-world experiments on two different\u0000sequential manipulation tasks demonstrate the effectiveness of our framework in\u0000improving LLMs' understanding of multi-modal demonstrations and enhancing the\u0000overall planning performance.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142267079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Learning-accelerated A* Search for Risk-aware Path Planning","authors":"Jun Xiang, Junfei Xie, Jun Chen","doi":"arxiv-2409.11634","DOIUrl":"https://doi.org/arxiv-2409.11634","url":null,"abstract":"Safety is a critical concern for urban flights of autonomous Unmanned Aerial\u0000Vehicles. In populated environments, risk should be accounted for to produce an\u0000effective and safe path, known as risk-aware path planning. Risk-aware path\u0000planning can be modeled as a Constrained Shortest Path (CSP) problem, aiming to\u0000identify the shortest possible route that adheres to specified safety\u0000thresholds. CSP is NP-hard and poses significant computational challenges.\u0000Although many traditional methods can solve it accurately, all of them are very\u0000slow. Our method introduces an additional safety dimension to the traditional\u0000A* (called ASD A*), enabling A* to handle CSP. Furthermore, we develop a custom\u0000learning-based heuristic using transformer-based neural networks, which\u0000significantly reduces the computational load and improves the performance of\u0000the ASD A* algorithm. The proposed method is well-validated with both random\u0000and realistic simulation scenarios.","PeriodicalId":501031,"journal":{"name":"arXiv - CS - Robotics","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142269797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}