LaRa: Latents and Rays for Multi-Camera Bird's-Eye-View Semantic Segmentation

Conference on Robot Learning Pub Date : 2022-06-27 DOI:10.48550/arXiv.2206.13294

Florent Bartoccioni, 'Eloi Zablocki, Andrei Bursuc, Patrick P'erez, M. Cord, Alahari Karteek

{"title":"LaRa: Latents and Rays for Multi-Camera Bird's-Eye-View Semantic Segmentation","authors":"Florent Bartoccioni, 'Eloi Zablocki, Andrei Bursuc, Patrick P'erez, M. Cord, Alahari Karteek","doi":"10.48550/arXiv.2206.13294","DOIUrl":null,"url":null,"abstract":"Recent works in autonomous driving have widely adopted the bird's-eye-view (BEV) semantic map as an intermediate representation of the world. Online prediction of these BEV maps involves non-trivial operations such as multi-camera data extraction as well as fusion and projection into a common topview grid. This is usually done with error-prone geometric operations (e.g., homography or back-projection from monocular depth estimation) or expensive direct dense mapping between image pixels and pixels in BEV (e.g., with MLP or attention). In this work, we present 'LaRa', an efficient encoder-decoder, transformer-based model for vehicle semantic segmentation from multiple cameras. Our approach uses a system of cross-attention to aggregate information over multiple sensors into a compact, yet rich, collection of latent representations. These latent representations, after being processed by a series of self-attention blocks, are then reprojected with a second cross-attention in the BEV space. We demonstrate that our model outperforms the best previous works using transformers on nuScenes. The code and trained models are available at https://github.com/valeoai/LaRa","PeriodicalId":273870,"journal":{"name":"Conference on Robot Learning","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Conference on Robot Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2206.13294","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

Abstract

Recent works in autonomous driving have widely adopted the bird's-eye-view (BEV) semantic map as an intermediate representation of the world. Online prediction of these BEV maps involves non-trivial operations such as multi-camera data extraction as well as fusion and projection into a common topview grid. This is usually done with error-prone geometric operations (e.g., homography or back-projection from monocular depth estimation) or expensive direct dense mapping between image pixels and pixels in BEV (e.g., with MLP or attention). In this work, we present 'LaRa', an efficient encoder-decoder, transformer-based model for vehicle semantic segmentation from multiple cameras. Our approach uses a system of cross-attention to aggregate information over multiple sensors into a compact, yet rich, collection of latent representations. These latent representations, after being processed by a series of self-attention blocks, are then reprojected with a second cross-attention in the BEV space. We demonstrate that our model outperforms the best previous works using transformers on nuScenes. The code and trained models are available at https://github.com/valeoai/LaRa

查看原文本刊更多论文

LaRa:多相机鸟瞰语义分割的潜势和光线

近年来，自动驾驶领域广泛采用鸟瞰图(BEV)语义图作为对世界的中间表示。这些BEV地图的在线预测涉及一些重要的操作，如多相机数据提取以及融合和投影到一个共同的俯视图网格中。这通常是通过容易出错的几何操作(例如，单目深度估计的同形或反向投影)或BEV中图像像素和像素之间昂贵的直接密集映射(例如，使用MLP或注意力)来完成的。在这项工作中，我们提出了“LaRa”，一种高效的编码器-解码器，基于变压器的模型，用于从多个摄像头中进行车辆语义分割。我们的方法使用交叉注意系统将多个传感器上的信息聚合成一个紧凑而丰富的潜在表征集合。这些潜在表征在经过一系列自我注意块的处理后，然后在BEV空间中用第二次交叉注意重新投射。我们证明了我们的模型优于以前在nuScenes上使用变压器的最佳作品。代码和经过训练的模型可在https://github.com/valeoai/LaRa上获得

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Conference on Robot Learning

自引率

0.00%

发文量