{"title":"Swin-MGNet: Swin Transformer Based Multiview Grouping Network for 3-D Object Recognition","authors":"Xin Ning;Limin Jiang;Weijun Li;Zaiyang Yu;Jinlong Xie;Lusi Li;Prayag Tiwari;Fernando Alonso-Fernandez","doi":"10.1109/TAI.2024.3492163","DOIUrl":null,"url":null,"abstract":"Recent developments in Swin Transformer have shown its great potential in various computer vision tasks, including image classification, semantic segmentation, and object detection. However, it is challenging to achieve desired performance by directly employing the Swin Transformer in multiview 3-D object recognition since the Swin Transformer independently extracts the characteristics of each view and relies heavily on a subsequent fusion strategy to unify the multiview information. This leads to the problem of the insufficient extraction of interdependencies between the multiview images. To this end, we propose an aggregation strategy integrated into the Swin Transformer to reinforce the connections between internal features across multiple views, thus leading to a complete interpretation of isolated features extracted by the Swin Transformer. Specifically, we utilize Swin Transformer to learn view-level feature representations from multiview images and then calculate their view discrimination scores. The scores are employed to assign the view-level features to different groups. Finally, a grouping and fusion network is proposed to aggregate the features from view and group levels. The experimental results indicate that our method attains state-of-the-art performance compared with prior approaches in multiview 3-D object recognition tasks. The source code is available at <uri>https://github.com/Qishaohua94/DEST</uri>.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"6 3","pages":"747-758"},"PeriodicalIF":0.0000,"publicationDate":"2024-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on artificial intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10745246/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Recent developments in Swin Transformer have shown its great potential in various computer vision tasks, including image classification, semantic segmentation, and object detection. However, it is challenging to achieve desired performance by directly employing the Swin Transformer in multiview 3-D object recognition since the Swin Transformer independently extracts the characteristics of each view and relies heavily on a subsequent fusion strategy to unify the multiview information. This leads to the problem of the insufficient extraction of interdependencies between the multiview images. To this end, we propose an aggregation strategy integrated into the Swin Transformer to reinforce the connections between internal features across multiple views, thus leading to a complete interpretation of isolated features extracted by the Swin Transformer. Specifically, we utilize Swin Transformer to learn view-level feature representations from multiview images and then calculate their view discrimination scores. The scores are employed to assign the view-level features to different groups. Finally, a grouping and fusion network is proposed to aggregate the features from view and group levels. The experimental results indicate that our method attains state-of-the-art performance compared with prior approaches in multiview 3-D object recognition tasks. The source code is available at https://github.com/Qishaohua94/DEST.