FANet: Feature Aggregation Network for Semantic Segmentation

2020 Digital Image Computing: Techniques and Applications (DICTA) Pub Date : 2020-11-29 DOI:10.1109/DICTA51227.2020.9363370

Tanmay Singha, Duc-Son Pham, A. Krishna

{"title":"FANet: Feature Aggregation Network for Semantic Segmentation","authors":"Tanmay Singha, Duc-Son Pham, A. Krishna","doi":"10.1109/DICTA51227.2020.9363370","DOIUrl":null,"url":null,"abstract":"Due to the rapid development in robotics and autonomous industries, optimization and accuracy have become an important factor in the field of computer vision. It becomes a challenging task for the researchers to design an efficient, optimized model with high accuracy in the field of object detection and semantic segmentation. Some existing off-line scene segmentation methods have shown an outstanding result on different datasets at the cost of a large number of parameters and operations, whereas some well-known real-time semantic segmentation techniques have reduced the number of parameters and operations in demand for resource-constrained applications, but model accuracy is compromised. We propose a novel approach for scene segmentation suitable for resource-constrained embedded devices by keeping a right balance between model architecture and model performance. Exploiting the multi-scale feature fusion technique with accurate localization augmentation, we introduce a fast feature aggregation network, a real-time scene segmentation model capable of handling high-resolution input image (1024 × 2048 px). Relying on an efficient embedded vision backbone network, our feature pyramid network outperforms many existing off-line and real-time pixel-wise deep convolution neural networks (CNNs) and produces 89.7% pixel accuracy and 65.9% mean intersection over union (mIoU) on the Cityscapes benchmark validation dataset whilst having only 1.1M parameters and 5.8B FLOPS.","PeriodicalId":348164,"journal":{"name":"2020 Digital Image Computing: Techniques and Applications (DICTA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 Digital Image Computing: Techniques and Applications (DICTA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DICTA51227.2020.9363370","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

Due to the rapid development in robotics and autonomous industries, optimization and accuracy have become an important factor in the field of computer vision. It becomes a challenging task for the researchers to design an efficient, optimized model with high accuracy in the field of object detection and semantic segmentation. Some existing off-line scene segmentation methods have shown an outstanding result on different datasets at the cost of a large number of parameters and operations, whereas some well-known real-time semantic segmentation techniques have reduced the number of parameters and operations in demand for resource-constrained applications, but model accuracy is compromised. We propose a novel approach for scene segmentation suitable for resource-constrained embedded devices by keeping a right balance between model architecture and model performance. Exploiting the multi-scale feature fusion technique with accurate localization augmentation, we introduce a fast feature aggregation network, a real-time scene segmentation model capable of handling high-resolution input image (1024 × 2048 px). Relying on an efficient embedded vision backbone network, our feature pyramid network outperforms many existing off-line and real-time pixel-wise deep convolution neural networks (CNNs) and produces 89.7% pixel accuracy and 65.9% mean intersection over union (mIoU) on the Cityscapes benchmark validation dataset whilst having only 1.1M parameters and 5.8B FLOPS.

查看原文本刊更多论文

语义分割的特征聚合网络

随着机器人技术和自主工业的快速发展，优化和精度已成为计算机视觉领域的一个重要因素。在目标检测和语义分割领域，如何设计一个高效、优化、高精度的模型是一项具有挑战性的任务。现有的离线场景分割方法虽然需要大量的参数和操作，但在不同的数据集上显示出优异的效果，而一些知名的实时语义分割技术在资源受限的应用中减少了参数和操作的数量，但模型的准确性受到影响。我们提出了一种适合于资源受限的嵌入式设备的场景分割新方法，该方法在模型架构和模型性能之间保持适当的平衡。利用多尺度特征融合技术和精确的定位增强，我们引入了一个快速特征聚合网络，一个能够处理高分辨率输入图像(1024 × 2048像素)的实时场景分割模型。基于高效的嵌入式视觉骨干网络，我们的特征金字塔网络优于许多现有的离线和实时像素级深度卷积神经网络(cnn)，在cityscape基准验证数据集上产生89.7%的像素精度和65.9%的平均交联(mIoU)，同时只有1.1M参数和5.8B FLOPS。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 Digital Image Computing: Techniques and Applications (DICTA)

自引率

0.00%

发文量