{"title":"A Novel SO(3) Rotational Equivariant Masked Autoencoder for 3D Mesh Object Analysis","authors":"Min Xie;Jieyu Zhao;Kedi Shen","doi":"10.1109/TCSVT.2024.3465041","DOIUrl":null,"url":null,"abstract":"Equivariant networks have recently made significant strides in computer vision tasks related to robotic grasping, molecule generation, and 6D pose tracking. In this paper, we explore 3D mesh object analysis based on an equivariant masked autoencoder to reduce the model dependence on large datasets and predict the pose transformation. We employ 3D reconstruction tasks under rotation and masking operations, such as segmentation tasks after rotation, as pretraining to enhance downstream task performance. To mitigate the computational complexity of the algorithm, we first utilize multiple non-overlapping 3D mesh patches with a fixed face size. We then design a rotation-equivariant self-attention mechanism to obtain advanced features. To improve the throughput of the encoder, we design a sparse token merging strategy. Our method achieves comparable performance on equivariant analysis tasks of mesh objects, such as 3D mesh pose transformation estimation, object classification and part segmentation on the ShapeNetCore16, Manifold40, COSEG-aliens, COSEG-vases and Human Body datasets. In the object classification task, we achieve superior performance even when only 10% of the original sample is used. We perform extensive ablation experiments to demonstrate the efficacy of critical design choices in our approach.","PeriodicalId":13082,"journal":{"name":"IEEE Transactions on Circuits and Systems for Video Technology","volume":"35 1","pages":"329-342"},"PeriodicalIF":8.3000,"publicationDate":"2024-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Circuits and Systems for Video Technology","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10684728/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
Equivariant networks have recently made significant strides in computer vision tasks related to robotic grasping, molecule generation, and 6D pose tracking. In this paper, we explore 3D mesh object analysis based on an equivariant masked autoencoder to reduce the model dependence on large datasets and predict the pose transformation. We employ 3D reconstruction tasks under rotation and masking operations, such as segmentation tasks after rotation, as pretraining to enhance downstream task performance. To mitigate the computational complexity of the algorithm, we first utilize multiple non-overlapping 3D mesh patches with a fixed face size. We then design a rotation-equivariant self-attention mechanism to obtain advanced features. To improve the throughput of the encoder, we design a sparse token merging strategy. Our method achieves comparable performance on equivariant analysis tasks of mesh objects, such as 3D mesh pose transformation estimation, object classification and part segmentation on the ShapeNetCore16, Manifold40, COSEG-aliens, COSEG-vases and Human Body datasets. In the object classification task, we achieve superior performance even when only 10% of the original sample is used. We perform extensive ablation experiments to demonstrate the efficacy of critical design choices in our approach.
期刊介绍:
The IEEE Transactions on Circuits and Systems for Video Technology (TCSVT) is dedicated to covering all aspects of video technologies from a circuits and systems perspective. We encourage submissions of general, theoretical, and application-oriented papers related to image and video acquisition, representation, presentation, and display. Additionally, we welcome contributions in areas such as processing, filtering, and transforms; analysis and synthesis; learning and understanding; compression, transmission, communication, and networking; as well as storage, retrieval, indexing, and search. Furthermore, papers focusing on hardware and software design and implementation are highly valued. Join us in advancing the field of video technology through innovative research and insights.