{"title":"CMGFA:基于跨模态混合组注意特征聚合器的 BEV 细分模型","authors":"Xinkai Kuang;Runxin Niu;Chen Hua;Chunmao Jiang;Hui Zhu;Ziyu Chen;Biao Yu","doi":"10.1109/LRA.2024.3495376","DOIUrl":null,"url":null,"abstract":"Bird's eye view (BEV) segmentation map is a recent development in autonomous driving that provides effective environmental information, such as drivable areas and lane dividers. Most of the existing methods use cameras and LiDAR as inputs for segmentation and the fusion of different modalities is accomplished through either concatenation or addition operations, which fails to exploit fully the correlation and complementarity between modalities. This letter presents the CMGFA (Cross-Modal Group-mix attention Feature Aggregator), an end-to-end learning framework that can adapt to multiple modal feature combinations for BEV segmentation. The CMGFA comprises the following components: i) The camera has a dual-branch structure that strengthens the linkage between local and global features. ii) Multi-head deformable cross-attention is applied as cross-modal feature aggregators to aggregate camera, LiDAR, and Radar feature maps in BEV for implicit fusion. iii) The Group-Mix attention is used to enrich the attention map feature space and enhance the ability to segment between different categories. We evaluate our proposed method on the nuScenes and Argoverse2 datasets, where the CMGFA significantly outperforms the baseline.","PeriodicalId":13241,"journal":{"name":"IEEE Robotics and Automation Letters","volume":"9 12","pages":"11497-11504"},"PeriodicalIF":4.6000,"publicationDate":"2024-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CMGFA: A BEV Segmentation Model Based on Cross-Modal Group-Mix Attention Feature Aggregator\",\"authors\":\"Xinkai Kuang;Runxin Niu;Chen Hua;Chunmao Jiang;Hui Zhu;Ziyu Chen;Biao Yu\",\"doi\":\"10.1109/LRA.2024.3495376\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Bird's eye view (BEV) segmentation map is a recent development in autonomous driving that provides effective environmental information, such as drivable areas and lane dividers. Most of the existing methods use cameras and LiDAR as inputs for segmentation and the fusion of different modalities is accomplished through either concatenation or addition operations, which fails to exploit fully the correlation and complementarity between modalities. This letter presents the CMGFA (Cross-Modal Group-mix attention Feature Aggregator), an end-to-end learning framework that can adapt to multiple modal feature combinations for BEV segmentation. The CMGFA comprises the following components: i) The camera has a dual-branch structure that strengthens the linkage between local and global features. ii) Multi-head deformable cross-attention is applied as cross-modal feature aggregators to aggregate camera, LiDAR, and Radar feature maps in BEV for implicit fusion. iii) The Group-Mix attention is used to enrich the attention map feature space and enhance the ability to segment between different categories. We evaluate our proposed method on the nuScenes and Argoverse2 datasets, where the CMGFA significantly outperforms the baseline.\",\"PeriodicalId\":13241,\"journal\":{\"name\":\"IEEE Robotics and Automation Letters\",\"volume\":\"9 12\",\"pages\":\"11497-11504\"},\"PeriodicalIF\":4.6000,\"publicationDate\":\"2024-11-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Robotics and Automation Letters\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10749835/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ROBOTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Robotics and Automation Letters","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10749835/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ROBOTICS","Score":null,"Total":0}
CMGFA: A BEV Segmentation Model Based on Cross-Modal Group-Mix Attention Feature Aggregator
Bird's eye view (BEV) segmentation map is a recent development in autonomous driving that provides effective environmental information, such as drivable areas and lane dividers. Most of the existing methods use cameras and LiDAR as inputs for segmentation and the fusion of different modalities is accomplished through either concatenation or addition operations, which fails to exploit fully the correlation and complementarity between modalities. This letter presents the CMGFA (Cross-Modal Group-mix attention Feature Aggregator), an end-to-end learning framework that can adapt to multiple modal feature combinations for BEV segmentation. The CMGFA comprises the following components: i) The camera has a dual-branch structure that strengthens the linkage between local and global features. ii) Multi-head deformable cross-attention is applied as cross-modal feature aggregators to aggregate camera, LiDAR, and Radar feature maps in BEV for implicit fusion. iii) The Group-Mix attention is used to enrich the attention map feature space and enhance the ability to segment between different categories. We evaluate our proposed method on the nuScenes and Argoverse2 datasets, where the CMGFA significantly outperforms the baseline.
期刊介绍:
The scope of this journal is to publish peer-reviewed articles that provide a timely and concise account of innovative research ideas and application results, reporting significant theoretical findings and application case studies in areas of robotics and automation.