轻量级、高效的特征融合实时语义分割网络

IF 4.2 3区计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Image and Vision Computing Pub Date : 2025-02-01 DOI:10.1016/j.imavis.2024.105408

Jie Zhong, Aiguo Chen, Yizhang Jiang, Chengcheng Sun, Yuheng Peng

{"title":"轻量级、高效的特征融合实时语义分割网络","authors":"Jie Zhong, Aiguo Chen, Yizhang Jiang, Chengcheng Sun, Yuheng Peng","doi":"10.1016/j.imavis.2024.105408","DOIUrl":null,"url":null,"abstract":"<div><div>The increasing demand for real-time performance in semantic segmentation for the field of autonomous driving has prompted a significant focus on the trade-off between speed and accuracy. Recently, many real-time semantic segmentation networks have opted for lightweight classification networks as their backbone. However, their lack of specificity for real-time semantic segmentation tasks compromises their ability to extract advanced semantic information effectively. This paper introduces the LAFFNet, a lightweight and efficient feature-fusion real-time semantic segmentation network. We devised a novel lightweight feature extraction block (LEB) to construct the encoder part, employing a combination of deep convolution and dilated convolution to extract local and global semantic features with minimal parameters, thereby enhancing feature map characterization. Additionally, we propose a coarse feature extractor block (CFEB) to recover lost local details during encoding and improve connectivity between encoding and decoding parts. In the decoding phase, we introduce the bilateral feature fusion block (BFFB), leveraging features from different inference stages to enhance the model’s ability to capture multi-scale features and conduct efficient feature fusion operations. Without pre-training, LAFFNet achieves a processing speed of 63.7 FPS on high-resolution (1024 × 2048) images from the Cityscapes dataset, with an mIoU of 77.06%. On the Camvid dataset, the model performs equally well, reaching 107.4 FPS with an mIoU of 68.29%. Notably, the model contains only 0.96 million parameters, demonstrating its exceptional efficiency in lightweight network design. These results demonstrate that LAFFNet achieves an optimal balance between accuracy and speed, providing an effective and precise solution for real-time semantic segmentation tasks.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105408"},"PeriodicalIF":4.2000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Lightweight and efficient feature fusion real-time semantic segmentation network\",\"authors\":\"Jie Zhong, Aiguo Chen, Yizhang Jiang, Chengcheng Sun, Yuheng Peng\",\"doi\":\"10.1016/j.imavis.2024.105408\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The increasing demand for real-time performance in semantic segmentation for the field of autonomous driving has prompted a significant focus on the trade-off between speed and accuracy. Recently, many real-time semantic segmentation networks have opted for lightweight classification networks as their backbone. However, their lack of specificity for real-time semantic segmentation tasks compromises their ability to extract advanced semantic information effectively. This paper introduces the LAFFNet, a lightweight and efficient feature-fusion real-time semantic segmentation network. We devised a novel lightweight feature extraction block (LEB) to construct the encoder part, employing a combination of deep convolution and dilated convolution to extract local and global semantic features with minimal parameters, thereby enhancing feature map characterization. Additionally, we propose a coarse feature extractor block (CFEB) to recover lost local details during encoding and improve connectivity between encoding and decoding parts. In the decoding phase, we introduce the bilateral feature fusion block (BFFB), leveraging features from different inference stages to enhance the model’s ability to capture multi-scale features and conduct efficient feature fusion operations. Without pre-training, LAFFNet achieves a processing speed of 63.7 FPS on high-resolution (1024 × 2048) images from the Cityscapes dataset, with an mIoU of 77.06%. On the Camvid dataset, the model performs equally well, reaching 107.4 FPS with an mIoU of 68.29%. Notably, the model contains only 0.96 million parameters, demonstrating its exceptional efficiency in lightweight network design. These results demonstrate that LAFFNet achieves an optimal balance between accuracy and speed, providing an effective and precise solution for real-time semantic segmentation tasks.</div></div>\",\"PeriodicalId\":50374,\"journal\":{\"name\":\"Image and Vision Computing\",\"volume\":\"154 \",\"pages\":\"Article 105408\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2025-02-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Image and Vision Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0262885624005134\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885624005134","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

自动驾驶领域对语义分割实时性能的需求日益增长，这促使人们关注速度和准确性之间的权衡。近年来，许多实时语义分割网络都选择轻量级分类网络作为其主干。然而，它们对实时语义分割任务缺乏特异性，影响了它们有效提取高级语义信息的能力。介绍了一种轻量级、高效的特征融合实时语义分割网络LAFFNet。我们设计了一种新的轻量级特征提取块（LEB）来构建编码器部分，采用深度卷积和扩展卷积相结合的方法以最小的参数提取局部和全局语义特征，从而增强特征映射表征。此外，我们提出了一个粗特征提取块（CFEB）来恢复编码过程中丢失的局部细节，并改善编码和解码部分之间的连通性。在解码阶段，我们引入了双边特征融合块（BFFB），利用不同推理阶段的特征来增强模型捕获多尺度特征的能力，并进行有效的特征融合操作。未经预训练，LAFFNet对来自cityscape数据集的高分辨率（1024 × 2048）图像的处理速度为63.7 FPS， mIoU为77.06%。在Camvid数据集上，该模型表现同样出色，达到107.4 FPS， mIoU为68.29%。值得注意的是，该模型仅包含96万个参数，显示了其在轻量级网络设计中的卓越效率。结果表明，LAFFNet实现了准确率和速度的最佳平衡，为实时语义分割任务提供了有效、精确的解决方案。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Lightweight and efficient feature fusion real-time semantic segmentation network

查看原文本刊更多论文

Lightweight and efficient feature fusion real-time semantic segmentation network

The increasing demand for real-time performance in semantic segmentation for the field of autonomous driving has prompted a significant focus on the trade-off between speed and accuracy. Recently, many real-time semantic segmentation networks have opted for lightweight classification networks as their backbone. However, their lack of specificity for real-time semantic segmentation tasks compromises their ability to extract advanced semantic information effectively. This paper introduces the LAFFNet, a lightweight and efficient feature-fusion real-time semantic segmentation network. We devised a novel lightweight feature extraction block (LEB) to construct the encoder part, employing a combination of deep convolution and dilated convolution to extract local and global semantic features with minimal parameters, thereby enhancing feature map characterization. Additionally, we propose a coarse feature extractor block (CFEB) to recover lost local details during encoding and improve connectivity between encoding and decoding parts. In the decoding phase, we introduce the bilateral feature fusion block (BFFB), leveraging features from different inference stages to enhance the model’s ability to capture multi-scale features and conduct efficient feature fusion operations. Without pre-training, LAFFNet achieves a processing speed of 63.7 FPS on high-resolution (1024 × 2048) images from the Cityscapes dataset, with an mIoU of 77.06%. On the Camvid dataset, the model performs equally well, reaching 107.4 FPS with an mIoU of 68.29%. Notably, the model contains only 0.96 million parameters, demonstrating its exceptional efficiency in lightweight network design. These results demonstrate that LAFFNet achieves an optimal balance between accuracy and speed, providing an effective and precise solution for real-time semantic segmentation tasks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Image and Vision Computing 工程技术-工程：电子与电气

CiteScore

8.50

自引率

8.50%

发文量

143

审稿时长

7.8 months

期刊介绍： Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.