DenseFuseNet:利用密集对应改进自动驾驶环境下的3D语义分割

2021 IEEE International Conference on Consumer Electronics and Computer Engineering (ICCECE) Pub Date : 2021-01-15 DOI:10.1109/ICCECE51280.2021.9342077

Yulun Wu

{"title":"DenseFuseNet:利用密集对应改进自动驾驶环境下的3D语义分割","authors":"Yulun Wu","doi":"10.1109/ICCECE51280.2021.9342077","DOIUrl":null,"url":null,"abstract":"With the development of deep convolutional networks, autonomous driving has been reforming human social activities in the recent decade. The core issue of autonomous driving is how to integrate the multi-modal perception system effectively, that is, using sensors such as lidar, RGB camera, and radar to identify general objects in traffic scenes. Extensive investigation shows that lidar and cameras are the two most powerful sensors widely used by autonomous driving companies such as Tesla and Waymo, which indeed revealed that how to integrate them effectively is bound to be one of the core issues in the field of autonomous driving in the future. Obviously, these two kinds of sensors have their inherent advantages and disadvantages. Based on the previous research works, we are motivated to fuse lidars and RGB cameras together to build a more robust perception system. It is not easy to design a model with two different domains from scratch, and a large number of previous works (e.g., FuseSeg [10]) has sufficiently proved that merging the RGB camera and lidar models can attain better results on vision tasks than the lidar model alone. However, it cannot adequately handle the inherent correspondence between the RGB camera and lidar data but rather arbitrarily interpolates between them, which quickly leads to severe distortion, heavy computational burden, and diminishing performance.To address these problems, in this paper, we proposed a general framework to establish a connection between lidar and RGB camera sensors, matching and fusing the features of the lidar and RGB models. We also defined two kinds of inaccuracies (missing pixels and covered points) in spherical projection and conducted a numerical analysis on them. Furthermore, we proposed an efficient filling algorithm to remedy the impact of missing pixels. Finally, we proposed a 3D semantic segmentation model, DenseFuseNet, which incorporated our techniques and achieved a noticeable 5.8 and 14.2 improvement in mIoU and accuracy on top of vanilla SqueezeSeg [24]. All code is already open-source on https://github.com/IDl0T/DenseFuseNet.","PeriodicalId":229425,"journal":{"name":"2021 IEEE International Conference on Consumer Electronics and Computer Engineering (ICCECE)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-01-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"DenseFuseNet: Improve 3D Semantic Segmentation in the Context of Autonomous Driving with Dense Correspondence\",\"authors\":\"Yulun Wu\",\"doi\":\"10.1109/ICCECE51280.2021.9342077\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With the development of deep convolutional networks, autonomous driving has been reforming human social activities in the recent decade. The core issue of autonomous driving is how to integrate the multi-modal perception system effectively, that is, using sensors such as lidar, RGB camera, and radar to identify general objects in traffic scenes. Extensive investigation shows that lidar and cameras are the two most powerful sensors widely used by autonomous driving companies such as Tesla and Waymo, which indeed revealed that how to integrate them effectively is bound to be one of the core issues in the field of autonomous driving in the future. Obviously, these two kinds of sensors have their inherent advantages and disadvantages. Based on the previous research works, we are motivated to fuse lidars and RGB cameras together to build a more robust perception system. It is not easy to design a model with two different domains from scratch, and a large number of previous works (e.g., FuseSeg [10]) has sufficiently proved that merging the RGB camera and lidar models can attain better results on vision tasks than the lidar model alone. However, it cannot adequately handle the inherent correspondence between the RGB camera and lidar data but rather arbitrarily interpolates between them, which quickly leads to severe distortion, heavy computational burden, and diminishing performance.To address these problems, in this paper, we proposed a general framework to establish a connection between lidar and RGB camera sensors, matching and fusing the features of the lidar and RGB models. We also defined two kinds of inaccuracies (missing pixels and covered points) in spherical projection and conducted a numerical analysis on them. Furthermore, we proposed an efficient filling algorithm to remedy the impact of missing pixels. Finally, we proposed a 3D semantic segmentation model, DenseFuseNet, which incorporated our techniques and achieved a noticeable 5.8 and 14.2 improvement in mIoU and accuracy on top of vanilla SqueezeSeg [24]. All code is already open-source on https://github.com/IDl0T/DenseFuseNet.\",\"PeriodicalId\":229425,\"journal\":{\"name\":\"2021 IEEE International Conference on Consumer Electronics and Computer Engineering (ICCECE)\",\"volume\":\"8 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-01-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE International Conference on Consumer Electronics and Computer Engineering (ICCECE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCECE51280.2021.9342077\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Conference on Consumer Electronics and Computer Engineering (ICCECE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCECE51280.2021.9342077","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

近十年来，随着深度卷积网络的发展，自动驾驶正在改变着人类的社会活动。自动驾驶的核心问题是如何有效地整合多模态感知系统，即利用激光雷达、RGB摄像头、雷达等传感器识别交通场景中的一般物体。广泛的调查显示，激光雷达和摄像头是特斯拉和Waymo等自动驾驶公司广泛使用的两种最强大的传感器，这确实揭示了如何有效地将它们整合起来必然是未来自动驾驶领域的核心问题之一。显然，这两种传感器有其固有的优点和缺点。基于以往的研究工作，我们有动力将激光雷达和RGB相机融合在一起，以建立一个更强大的感知系统。从零开始设计两个不同域的模型并不容易，之前的大量工作(如FuseSeg[10])已经充分证明，将RGB相机和激光雷达模型合并在视觉任务上可以比单独使用激光雷达模型获得更好的结果。然而，它不能充分处理RGB相机和激光雷达数据之间的内在对应关系，而是随意地在它们之间进行插值，这很快导致严重的失真，计算负担大，性能下降。针对这些问题，本文提出了一个建立激光雷达与RGB相机传感器之间连接的总体框架，将激光雷达与RGB模型的特征进行匹配和融合。我们还定义了球面投影中的两种误差(缺失像素和覆盖点)，并对其进行了数值分析。此外，我们提出了一种有效的填充算法来弥补缺失像素的影响。最后，我们提出了一个3D语义分割模型，DenseFuseNet，它结合了我们的技术，在朴素的SqueezeSeg的基础上，mIoU和准确率分别提高了5.8和14.2[24]。所有的代码都已经在https://github.com/IDl0T/DenseFuseNet上开源了。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

DenseFuseNet: Improve 3D Semantic Segmentation in the Context of Autonomous Driving with Dense Correspondence

With the development of deep convolutional networks, autonomous driving has been reforming human social activities in the recent decade. The core issue of autonomous driving is how to integrate the multi-modal perception system effectively, that is, using sensors such as lidar, RGB camera, and radar to identify general objects in traffic scenes. Extensive investigation shows that lidar and cameras are the two most powerful sensors widely used by autonomous driving companies such as Tesla and Waymo, which indeed revealed that how to integrate them effectively is bound to be one of the core issues in the field of autonomous driving in the future. Obviously, these two kinds of sensors have their inherent advantages and disadvantages. Based on the previous research works, we are motivated to fuse lidars and RGB cameras together to build a more robust perception system. It is not easy to design a model with two different domains from scratch, and a large number of previous works (e.g., FuseSeg [10]) has sufficiently proved that merging the RGB camera and lidar models can attain better results on vision tasks than the lidar model alone. However, it cannot adequately handle the inherent correspondence between the RGB camera and lidar data but rather arbitrarily interpolates between them, which quickly leads to severe distortion, heavy computational burden, and diminishing performance.To address these problems, in this paper, we proposed a general framework to establish a connection between lidar and RGB camera sensors, matching and fusing the features of the lidar and RGB models. We also defined two kinds of inaccuracies (missing pixels and covered points) in spherical projection and conducted a numerical analysis on them. Furthermore, we proposed an efficient filling algorithm to remedy the impact of missing pixels. Finally, we proposed a 3D semantic segmentation model, DenseFuseNet, which incorporated our techniques and achieved a noticeable 5.8 and 14.2 improvement in mIoU and accuracy on top of vanilla SqueezeSeg [24]. All code is already open-source on https://github.com/IDl0T/DenseFuseNet.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2021 IEEE International Conference on Consumer Electronics and Computer Engineering (ICCECE)

自引率

0.00%

发文量