{"title":"Category-Level Multi-Object 9D State Tracking Using Object-Centric Multi-Scale Transformer in Point Cloud Stream","authors":"Jingtao Sun;Yaonan Wang;Mingtao Feng;Xiaofeng Guo;Huimin Lu;Xieyuanli Chen","doi":"10.1109/TMM.2024.3521664","DOIUrl":null,"url":null,"abstract":"Category-level object pose estimation and tracking has achieved impressive progress in computer vision, augmented reality, and robotics. Existing methods either estimate the object states from a single observation or only track the 6-DoF pose of a single object. In this paper, we focus on category-level multi-object 9-Dimensional (9D) state tracking from the point cloud stream. We propose a novel 9D state estimation network to estimate the 6-DoF pose and 3D size of each instance in the scene. It uses our devised multi-scale global attention and object-level local attention modules to obtain representative latent features to estimate the 9D state of each object in the current observation. We then integrate our network estimation into a Kalman filter to combine previous states with the current estimates and achieve multi-object 9D state tracking. Experiment results on two public datasets show that our method achieves state-of-the-art performance on both category-level multi-object state estimation and pose tracking tasks. Furthermore, we directly apply the pre-trained model of our method to our air-ground robot system with multiple moving objects. Experiments on our collected real-world dataset show our method's strong generalization ability and real-time pose tracking performance.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"1072-1085"},"PeriodicalIF":8.4000,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10814715/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Category-level object pose estimation and tracking has achieved impressive progress in computer vision, augmented reality, and robotics. Existing methods either estimate the object states from a single observation or only track the 6-DoF pose of a single object. In this paper, we focus on category-level multi-object 9-Dimensional (9D) state tracking from the point cloud stream. We propose a novel 9D state estimation network to estimate the 6-DoF pose and 3D size of each instance in the scene. It uses our devised multi-scale global attention and object-level local attention modules to obtain representative latent features to estimate the 9D state of each object in the current observation. We then integrate our network estimation into a Kalman filter to combine previous states with the current estimates and achieve multi-object 9D state tracking. Experiment results on two public datasets show that our method achieves state-of-the-art performance on both category-level multi-object state estimation and pose tracking tasks. Furthermore, we directly apply the pre-trained model of our method to our air-ground robot system with multiple moving objects. Experiments on our collected real-world dataset show our method's strong generalization ability and real-time pose tracking performance.
期刊介绍:
The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.