Jie Zhong, Aiguo Chen, Yizhang Jiang, Chengcheng Sun, Yuheng Peng
{"title":"轻量级、高效的特征融合实时语义分割网络","authors":"Jie Zhong, Aiguo Chen, Yizhang Jiang, Chengcheng Sun, Yuheng Peng","doi":"10.1016/j.imavis.2024.105408","DOIUrl":null,"url":null,"abstract":"<div><div>The increasing demand for real-time performance in semantic segmentation for the field of autonomous driving has prompted a significant focus on the trade-off between speed and accuracy. Recently, many real-time semantic segmentation networks have opted for lightweight classification networks as their backbone. However, their lack of specificity for real-time semantic segmentation tasks compromises their ability to extract advanced semantic information effectively. This paper introduces the LAFFNet, a lightweight and efficient feature-fusion real-time semantic segmentation network. We devised a novel lightweight feature extraction block (LEB) to construct the encoder part, employing a combination of deep convolution and dilated convolution to extract local and global semantic features with minimal parameters, thereby enhancing feature map characterization. Additionally, we propose a coarse feature extractor block (CFEB) to recover lost local details during encoding and improve connectivity between encoding and decoding parts. In the decoding phase, we introduce the bilateral feature fusion block (BFFB), leveraging features from different inference stages to enhance the model’s ability to capture multi-scale features and conduct efficient feature fusion operations. Without pre-training, LAFFNet achieves a processing speed of 63.7 FPS on high-resolution (1024 × 2048) images from the Cityscapes dataset, with an mIoU of 77.06%. On the Camvid dataset, the model performs equally well, reaching 107.4 FPS with an mIoU of 68.29%. Notably, the model contains only 0.96 million parameters, demonstrating its exceptional efficiency in lightweight network design. These results demonstrate that LAFFNet achieves an optimal balance between accuracy and speed, providing an effective and precise solution for real-time semantic segmentation tasks.</div></div>","PeriodicalId":50374,"journal":{"name":"Image and Vision Computing","volume":"154 ","pages":"Article 105408"},"PeriodicalIF":4.2000,"publicationDate":"2025-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Lightweight and efficient feature fusion real-time semantic segmentation network\",\"authors\":\"Jie Zhong, Aiguo Chen, Yizhang Jiang, Chengcheng Sun, Yuheng Peng\",\"doi\":\"10.1016/j.imavis.2024.105408\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The increasing demand for real-time performance in semantic segmentation for the field of autonomous driving has prompted a significant focus on the trade-off between speed and accuracy. Recently, many real-time semantic segmentation networks have opted for lightweight classification networks as their backbone. However, their lack of specificity for real-time semantic segmentation tasks compromises their ability to extract advanced semantic information effectively. This paper introduces the LAFFNet, a lightweight and efficient feature-fusion real-time semantic segmentation network. We devised a novel lightweight feature extraction block (LEB) to construct the encoder part, employing a combination of deep convolution and dilated convolution to extract local and global semantic features with minimal parameters, thereby enhancing feature map characterization. Additionally, we propose a coarse feature extractor block (CFEB) to recover lost local details during encoding and improve connectivity between encoding and decoding parts. In the decoding phase, we introduce the bilateral feature fusion block (BFFB), leveraging features from different inference stages to enhance the model’s ability to capture multi-scale features and conduct efficient feature fusion operations. Without pre-training, LAFFNet achieves a processing speed of 63.7 FPS on high-resolution (1024 × 2048) images from the Cityscapes dataset, with an mIoU of 77.06%. On the Camvid dataset, the model performs equally well, reaching 107.4 FPS with an mIoU of 68.29%. Notably, the model contains only 0.96 million parameters, demonstrating its exceptional efficiency in lightweight network design. These results demonstrate that LAFFNet achieves an optimal balance between accuracy and speed, providing an effective and precise solution for real-time semantic segmentation tasks.</div></div>\",\"PeriodicalId\":50374,\"journal\":{\"name\":\"Image and Vision Computing\",\"volume\":\"154 \",\"pages\":\"Article 105408\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2025-02-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Image and Vision Computing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0262885624005134\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Image and Vision Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0262885624005134","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Lightweight and efficient feature fusion real-time semantic segmentation network
The increasing demand for real-time performance in semantic segmentation for the field of autonomous driving has prompted a significant focus on the trade-off between speed and accuracy. Recently, many real-time semantic segmentation networks have opted for lightweight classification networks as their backbone. However, their lack of specificity for real-time semantic segmentation tasks compromises their ability to extract advanced semantic information effectively. This paper introduces the LAFFNet, a lightweight and efficient feature-fusion real-time semantic segmentation network. We devised a novel lightweight feature extraction block (LEB) to construct the encoder part, employing a combination of deep convolution and dilated convolution to extract local and global semantic features with minimal parameters, thereby enhancing feature map characterization. Additionally, we propose a coarse feature extractor block (CFEB) to recover lost local details during encoding and improve connectivity between encoding and decoding parts. In the decoding phase, we introduce the bilateral feature fusion block (BFFB), leveraging features from different inference stages to enhance the model’s ability to capture multi-scale features and conduct efficient feature fusion operations. Without pre-training, LAFFNet achieves a processing speed of 63.7 FPS on high-resolution (1024 × 2048) images from the Cityscapes dataset, with an mIoU of 77.06%. On the Camvid dataset, the model performs equally well, reaching 107.4 FPS with an mIoU of 68.29%. Notably, the model contains only 0.96 million parameters, demonstrating its exceptional efficiency in lightweight network design. These results demonstrate that LAFFNet achieves an optimal balance between accuracy and speed, providing an effective and precise solution for real-time semantic segmentation tasks.
期刊介绍:
Image and Vision Computing has as a primary aim the provision of an effective medium of interchange for the results of high quality theoretical and applied research fundamental to all aspects of image interpretation and computer vision. The journal publishes work that proposes new image interpretation and computer vision methodology or addresses the application of such methods to real world scenes. It seeks to strengthen a deeper understanding in the discipline by encouraging the quantitative comparison and performance evaluation of the proposed methodology. The coverage includes: image interpretation, scene modelling, object recognition and tracking, shape analysis, monitoring and surveillance, active vision and robotic systems, SLAM, biologically-inspired computer vision, motion analysis, stereo vision, document image understanding, character and handwritten text recognition, face and gesture recognition, biometrics, vision-based human-computer interaction, human activity and behavior understanding, data fusion from multiple sensor inputs, image databases.