ZMNet：用于实时语义分割的特征融合和语义边界监督

The Visual Computer Pub Date : 2024-06-20 DOI:10.1007/s00371-024-03448-6

Ya Li, Ziming Li, Huiwang Liu, Qing Wang

{"title":"ZMNet：用于实时语义分割的特征融合和语义边界监督","authors":"Ya Li, Ziming Li, Huiwang Liu, Qing Wang","doi":"10.1007/s00371-024-03448-6","DOIUrl":null,"url":null,"abstract":"<p>Feature fusion module is an essential component of real-time semantic segmentation networks to bridge the semantic gap among different feature layers. However, many networks are inefficient in multi-level feature fusion. In this paper, we propose a simple yet effective decoder that consists of a series of multi-level attention feature fusion modules (MLA-FFMs) aimed at fusing multi-level features in a top-down manner. Specifically, MLA-FFM is a lightweight attention-based module. Therefore, it can not only efficiently fuse features to bridge the semantic gap at different levels, but also be applied to real-time segmentation tasks. In addition, to solve the problem of low accuracy of existing real-time segmentation methods at semantic boundaries, we propose a semantic boundary supervision module (BSM) to improve the accuracy by supervising the prediction of semantic boundaries. Extensive experiments demonstrate that our network achieves a state-of-the-art trade-off between segmentation accuracy and inference speed on both Cityscapes and CamVid datasets. On a single NVIDIA GeForce 1080Ti GPU, our model achieves 77.4% mIoU with a speed of 97.5 FPS on the Cityscapes test dataset, and 74% mIoU with a speed of 156.6 FPS on the CamVid test dataset, which is superior to most state-of-the-art real-time methods.</p>","PeriodicalId":501186,"journal":{"name":"The Visual Computer","volume":"174 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ZMNet: feature fusion and semantic boundary supervision for real-time semantic segmentation\",\"authors\":\"Ya Li, Ziming Li, Huiwang Liu, Qing Wang\",\"doi\":\"10.1007/s00371-024-03448-6\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Feature fusion module is an essential component of real-time semantic segmentation networks to bridge the semantic gap among different feature layers. However, many networks are inefficient in multi-level feature fusion. In this paper, we propose a simple yet effective decoder that consists of a series of multi-level attention feature fusion modules (MLA-FFMs) aimed at fusing multi-level features in a top-down manner. Specifically, MLA-FFM is a lightweight attention-based module. Therefore, it can not only efficiently fuse features to bridge the semantic gap at different levels, but also be applied to real-time segmentation tasks. In addition, to solve the problem of low accuracy of existing real-time segmentation methods at semantic boundaries, we propose a semantic boundary supervision module (BSM) to improve the accuracy by supervising the prediction of semantic boundaries. Extensive experiments demonstrate that our network achieves a state-of-the-art trade-off between segmentation accuracy and inference speed on both Cityscapes and CamVid datasets. On a single NVIDIA GeForce 1080Ti GPU, our model achieves 77.4% mIoU with a speed of 97.5 FPS on the Cityscapes test dataset, and 74% mIoU with a speed of 156.6 FPS on the CamVid test dataset, which is superior to most state-of-the-art real-time methods.</p>\",\"PeriodicalId\":501186,\"journal\":{\"name\":\"The Visual Computer\",\"volume\":\"174 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-06-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The Visual Computer\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s00371-024-03448-6\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Visual Computer","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s00371-024-03448-6","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

特征融合模块是实时语义分割网络的重要组成部分，可弥合不同特征层之间的语义差距。然而，许多网络在多层次特征融合方面效率低下。在本文中，我们提出了一种简单而有效的解码器，它由一系列多层次注意力特征融合模块（MLA-FFM）组成，旨在以自上而下的方式融合多层次特征。具体来说，MLA-FFM 是一种基于注意力的轻量级模块。因此，它不仅能有效地融合特征，弥合不同层次的语义差距，还能应用于实时分割任务。此外，为了解决现有实时分割方法在语义边界准确率低的问题，我们提出了语义边界监督模块（BSM），通过监督语义边界的预测来提高准确率。广泛的实验证明，我们的网络在 Cityscapes 和 CamVid 数据集上实现了分割精度和推理速度之间的最佳平衡。在单个 NVIDIA GeForce 1080Ti GPU 上，我们的模型在 Cityscapes 测试数据集上以 97.5 FPS 的速度实现了 77.4% 的 mIoU，在 CamVid 测试数据集上以 156.6 FPS 的速度实现了 74% 的 mIoU，优于大多数最先进的实时方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

ZMNet: feature fusion and semantic boundary supervision for real-time semantic segmentation

查看原文本刊更多论文

ZMNet: feature fusion and semantic boundary supervision for real-time semantic segmentation

Feature fusion module is an essential component of real-time semantic segmentation networks to bridge the semantic gap among different feature layers. However, many networks are inefficient in multi-level feature fusion. In this paper, we propose a simple yet effective decoder that consists of a series of multi-level attention feature fusion modules (MLA-FFMs) aimed at fusing multi-level features in a top-down manner. Specifically, MLA-FFM is a lightweight attention-based module. Therefore, it can not only efficiently fuse features to bridge the semantic gap at different levels, but also be applied to real-time segmentation tasks. In addition, to solve the problem of low accuracy of existing real-time segmentation methods at semantic boundaries, we propose a semantic boundary supervision module (BSM) to improve the accuracy by supervising the prediction of semantic boundaries. Extensive experiments demonstrate that our network achieves a state-of-the-art trade-off between segmentation accuracy and inference speed on both Cityscapes and CamVid datasets. On a single NVIDIA GeForce 1080Ti GPU, our model achieves 77.4% mIoU with a speed of 97.5 FPS on the Cityscapes test dataset, and 74% mIoU with a speed of 156.6 FPS on the CamVid test dataset, which is superior to most state-of-the-art real-time methods.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

The Visual Computer

自引率

0.00%

发文量