Nan Zhou;Mingming Xu;Biaoqun Shen;Ke Hou;Shanwei Liu;Hui Sheng;Yanfen Liu;Jianhua Wan
{"title":"ViT-UNet: A Vision Transformer Based UNet Model for Coastal Wetland Classification Based on High Spatial Resolution Imagery","authors":"Nan Zhou;Mingming Xu;Biaoqun Shen;Ke Hou;Shanwei Liu;Hui Sheng;Yanfen Liu;Jianhua Wan","doi":"10.1109/JSTARS.2024.3487250","DOIUrl":null,"url":null,"abstract":"High resolution remote sensing imagery plays a crucial role in monitoring coastal wetlands. Coastal wetland landscapes exhibit diverse features, ranging from fragmented patches to expansive areas. Mainstream convolutional neural networks cannot effectively analyze spatial relationships among consecutive image elements. This limitation impedes their performance in accurately classifying coastal wetlands. In order to tackle the above issues, we propose a Vision Transformer based UNet (ViT-UNet) model. This model extracts wetland features from high resolution remote sensing images by sensing and optimizing multiscale features. To establish global dependencies, the Vision Transformer (ViT) is introduced to replace the convolutional layer in the UNet encoder. Simultaneously, the model incorporates a convolutional block attention module and a multiple hierarchies attention module to restore attentional features and reduce feature loss. In addition, a skip connection is added to the single-skip structure of the original UNet model. This connection simultaneously links the output of the entire transformer and internal attention features to the corresponding decoder level. This enhancement aims to furnish the decoder with comprehensive global information guidance. Finally, all the extracted feature information is fused using Bilinear Polymerization Pooling (BPP). The BPP assists the network in obtaining a more comprehensive and detailed feature representation. Experimental results on the Gaofen-1 dataset demonstrate that the proposed ViT-UNet method achieves a Precision score of 93.50\n<inline-formula><tex-math>$\\%$</tex-math></inline-formula>\n, outperforming the original UNet model by 4.10\n<inline-formula><tex-math>$\\%$</tex-math></inline-formula>\n. Compared with other state-of-the-art networks, ViT-UNet performs more accurately and finer in the extraction of wetland information in the Yellow River Delta.","PeriodicalId":13116,"journal":{"name":"IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing","volume":"17 ","pages":"19575-19587"},"PeriodicalIF":4.7000,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10737119","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10737119/","RegionNum":2,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ENGINEERING, ELECTRICAL & ELECTRONIC","Score":null,"Total":0}
引用次数: 0
Abstract
High resolution remote sensing imagery plays a crucial role in monitoring coastal wetlands. Coastal wetland landscapes exhibit diverse features, ranging from fragmented patches to expansive areas. Mainstream convolutional neural networks cannot effectively analyze spatial relationships among consecutive image elements. This limitation impedes their performance in accurately classifying coastal wetlands. In order to tackle the above issues, we propose a Vision Transformer based UNet (ViT-UNet) model. This model extracts wetland features from high resolution remote sensing images by sensing and optimizing multiscale features. To establish global dependencies, the Vision Transformer (ViT) is introduced to replace the convolutional layer in the UNet encoder. Simultaneously, the model incorporates a convolutional block attention module and a multiple hierarchies attention module to restore attentional features and reduce feature loss. In addition, a skip connection is added to the single-skip structure of the original UNet model. This connection simultaneously links the output of the entire transformer and internal attention features to the corresponding decoder level. This enhancement aims to furnish the decoder with comprehensive global information guidance. Finally, all the extracted feature information is fused using Bilinear Polymerization Pooling (BPP). The BPP assists the network in obtaining a more comprehensive and detailed feature representation. Experimental results on the Gaofen-1 dataset demonstrate that the proposed ViT-UNet method achieves a Precision score of 93.50
$\%$
, outperforming the original UNet model by 4.10
$\%$
. Compared with other state-of-the-art networks, ViT-UNet performs more accurately and finer in the extraction of wetland information in the Yellow River Delta.
期刊介绍:
The IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing addresses the growing field of applications in Earth observations and remote sensing, and also provides a venue for the rapidly expanding special issues that are being sponsored by the IEEE Geosciences and Remote Sensing Society. The journal draws upon the experience of the highly successful “IEEE Transactions on Geoscience and Remote Sensing” and provide a complementary medium for the wide range of topics in applied earth observations. The ‘Applications’ areas encompasses the societal benefit areas of the Global Earth Observations Systems of Systems (GEOSS) program. Through deliberations over two years, ministers from 50 countries agreed to identify nine areas where Earth observation could positively impact the quality of life and health of their respective countries. Some of these are areas not traditionally addressed in the IEEE context. These include biodiversity, health and climate. Yet it is the skill sets of IEEE members, in areas such as observations, communications, computers, signal processing, standards and ocean engineering, that form the technical underpinnings of GEOSS. Thus, the Journal attracts a broad range of interests that serves both present members in new ways and expands the IEEE visibility into new areas.