RoadFormer: Duplex Transformer for RGB-Normal Semantic Road Scene Parsing

IF 14 1区工程技术 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Intelligent Vehicles Pub Date : 2024-04-17 DOI:10.1109/TIV.2024.3388726

Jiahang Li;Yikang Zhang;Peng Yun;Guangliang Zhou;Qijun Chen;Rui Fan

{"title":"RoadFormer: Duplex Transformer for RGB-Normal Semantic Road Scene Parsing","authors":"Jiahang Li;Yikang Zhang;Peng Yun;Guangliang Zhou;Qijun Chen;Rui Fan","doi":"10.1109/TIV.2024.3388726","DOIUrl":null,"url":null,"abstract":"The recent advancements in deep convolutional neural networks have shown significant promise in the domain of road scene parsing. Nevertheless, the existing works focus primarily on freespace detection, with little attention given to hazardous road defects that could compromise both driving safety and comfort. In this article, we introduce RoadFormer, a novel Transformer-based data-fusion network developed for road scene parsing. RoadFormer utilizes a duplex encoder architecture to extract heterogeneous features from both RGB images and surface normal information. The encoded features are subsequently fed into a novel heterogeneous feature synergy block for effective feature fusion and recalibration. The pixel decoder then learns multi-scale long-range dependencies from the fused and recalibrated heterogeneous features, which are subsequently processed by a Transformer decoder to produce the final semantic prediction. Additionally, we release SYN-UDTIRI, the first large-scale road scene parsing dataset that contains over 10,407 RGB images, dense depth images, and the corresponding pixel-level annotations for both freespace and road defects of different shapes and sizes. Extensive experimental evaluations conducted on our SYN-UDTIRI dataset, as well as on three public datasets, including KITTI road, CityScapes, and ORFD, demonstrate that RoadFormer outperforms all other state-of-the-art networks for road scene parsing. Specifically, RoadFormer ranks first on the KITTI road benchmark.","PeriodicalId":36532,"journal":{"name":"IEEE Transactions on Intelligent Vehicles","volume":"9 7","pages":"5163-5172"},"PeriodicalIF":14.0000,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Intelligent Vehicles","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10504607/","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

The recent advancements in deep convolutional neural networks have shown significant promise in the domain of road scene parsing. Nevertheless, the existing works focus primarily on freespace detection, with little attention given to hazardous road defects that could compromise both driving safety and comfort. In this article, we introduce RoadFormer, a novel Transformer-based data-fusion network developed for road scene parsing. RoadFormer utilizes a duplex encoder architecture to extract heterogeneous features from both RGB images and surface normal information. The encoded features are subsequently fed into a novel heterogeneous feature synergy block for effective feature fusion and recalibration. The pixel decoder then learns multi-scale long-range dependencies from the fused and recalibrated heterogeneous features, which are subsequently processed by a Transformer decoder to produce the final semantic prediction. Additionally, we release SYN-UDTIRI, the first large-scale road scene parsing dataset that contains over 10,407 RGB images, dense depth images, and the corresponding pixel-level annotations for both freespace and road defects of different shapes and sizes. Extensive experimental evaluations conducted on our SYN-UDTIRI dataset, as well as on three public datasets, including KITTI road, CityScapes, and ORFD, demonstrate that RoadFormer outperforms all other state-of-the-art networks for road scene parsing. Specifically, RoadFormer ranks first on the KITTI road benchmark.

查看原文本刊更多论文

RoadFormer：用于 RGB-Normal 道路场景语义解析的双工变换器

深度卷积神经网络的最新进展已在道路场景解析领域展现出巨大前景。然而，现有的工作主要集中在自由空间检测上，很少关注可能影响驾驶安全和舒适度的危险道路缺陷。在本文中，我们将介绍一种基于 Transformer 的新型数据融合网络 RoadFormer，它是专为道路场景解析而开发的。RoadFormer 采用双工编码器架构，从 RGB 图像和表面法线信息中提取异构特征。编码后的特征随后被送入一个新颖的异构特征协同块，以进行有效的特征融合和重新校准。然后，像素解码器会从融合和重新校准的异构特征中学习多尺度长距离依赖关系，并由变换器解码器进行处理，以生成最终的语义预测结果。此外，我们还发布了首个大规模道路场景解析数据集 SYN-UDTIRI，该数据集包含超过 10,407 幅 RGB 图像、高密度深度图像以及相应的像素级注释，涉及不同形状和大小的自由空间和道路缺陷。在我们的 SYN-UDTIRI 数据集以及三个公开数据集（包括 KITTI road、CityScapes 和 ORFD）上进行的广泛实验评估表明，RoadFormer 在道路场景解析方面的表现优于所有其他最先进的网络。特别是在 KITTI 道路基准测试中，RoadFormer 排名第一。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Intelligent Vehicles Mathematics-Control and Optimization

CiteScore

12.10

自引率

13.40%

发文量

177

期刊介绍： The IEEE Transactions on Intelligent Vehicles (T-IV) is a premier platform for publishing peer-reviewed articles that present innovative research concepts, application results, significant theoretical findings, and application case studies in the field of intelligent vehicles. With a particular emphasis on automated vehicles within roadway environments, T-IV aims to raise awareness of pressing research and application challenges. Our focus is on providing critical information to the intelligent vehicle community, serving as a dissemination vehicle for IEEE ITS Society members and others interested in learning about the state-of-the-art developments and progress in research and applications related to intelligent vehicles. Join us in advancing knowledge and innovation in this dynamic field.