{"title":"STFMamba: Spatiotemporal satellite image fusion network based on visual state space model","authors":"Min Zhao , Xiaolu Jiang , Bo Huang","doi":"10.1016/j.isprsjprs.2025.07.011","DOIUrl":"10.1016/j.isprsjprs.2025.07.011","url":null,"abstract":"<div><div>Remote sensing images provide extensive information about Earth’s surface, supporting a wide range of applications. Individual sensors often encounter a trade-off between spatial and temporal resolutions, spatiotemporal fusion (STF) aims to overcome this shortcoming by combining multisource data. Existing deep learning-based STF methods struggle with capturing long-range dependencies (CNN-based) or incur high computational cost (Transformer-based). To overcome these limitations, we propose STFMamba, a two-step state space model that effectively captures global information while maintaining linear complexity. Specifically, a super-resolution (SR) network is firstly utilized to mitigate sensor heterogeneity of multisource data, then a dual U-Net is designed to fully leverage spatio-temporal correlations and capture temporal variations. Our STFMamba contains the following three key components: 1) the multidimensional scanning mechanism for global relationship modeling to eliminate information loss, 2) a spatio-spectral–temporal fusion scanning strategy to integrate multiscale contextual features, and 3) a multi-head cross-attention module for adaptive selection and fusion. Additionally, we develop a lightweight version of STFMamba for deployment on resource-constrained devices, incorporating a knowledge distillation strategy to align its features with the base model and enhance performance. Extensive experiments on three benchmark datasets demonstrate the superiority of the proposed method. Specifically, our method outperforms compared methods, including FSDAF, FVSDF, EDCSTFN, GANSTFM, SwinSTFM, and DDPMSTF, with average RMSE reductions of 24.25%, 25.94%, 18.15%, 14.36%, 9.63%, and 12.82%, respectively. Our code is available at: <span><span>https://github.com/zhaomin0101/STFMamba</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50269,"journal":{"name":"ISPRS Journal of Photogrammetry and Remote Sensing","volume":"228 ","pages":"Pages 288-304"},"PeriodicalIF":10.6,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144680104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Angelos Zavras , Dimitrios Michail , Begüm Demir , Ioannis Papoutsis
{"title":"Mind the modality gap: Towards a remote sensing vision-language model via cross-modal alignment","authors":"Angelos Zavras , Dimitrios Michail , Begüm Demir , Ioannis Papoutsis","doi":"10.1016/j.isprsjprs.2025.06.019","DOIUrl":"10.1016/j.isprsjprs.2025.06.019","url":null,"abstract":"<div><div>Deep Learning (DL) is undergoing a paradigm shift with the emergence of foundation models. In this work, we focus on Contrastive Language-Image Pre-training (CLIP), a Vision-Language foundation model that achieves high accuracy across various image classification tasks and often rivals fully supervised baselines, despite not being explicitly trained for those tasks. Nevertheless, there are still domains where zero-shot CLIP performance is far from optimal, such as Remote Sensing (RS) and medical imagery. These domains do not only exhibit fundamentally different distributions compared to natural images, but also commonly rely on complementary modalities, beyond RGB, to derive meaningful insights. To this end, we propose a methodology to align distinct RS image modalities with the visual and textual modalities of CLIP. Our two-stage procedure addresses the aforementioned distribution shift, extends the zero-shot capabilities of CLIP and enriches CLIP’s shared embedding space with domain-specific knowledge. Initially, we robustly fine-tune CLIP according to the PAINT (Ilharco et al., 2022) patching protocol, in order to deal with the distribution shift. Building upon this foundation, we facilitate the cross-modal alignment of a RS modality encoder by distilling knowledge from the CLIP visual and textual encoders. We empirically show that both patching and cross-modal alignment translate to significant performance gains, across several RS imagery classification and cross-modal retrieval benchmark datasets. Patching dramatically improves RS imagery (RGB) classification (BigEarthNet-5: +39.76% mAP, BigEarthNet-19: +56.86% mAP, BigEarthNet-43: +28.43% mAP, SEN12MS: +20.61% mAP, EuroSAT: +5.98% Acc), while it maintains performance on the representative supported task (ImageNet), and most critically it outperforms existing RS-specialized CLIP variants such as RemoteCLIP (Liu et al., 2023a) and SkyCLIP (Wang et al., 2024). Cross-modal alignment extends zero-shot capabilities to multi-spectral data, surpassing our patched CLIP classification performance and establishing strong cross-modal retrieval baselines. Linear probing further confirms the quality of learned representations of our aligned multi-spectral encoder, outperforming existing RS foundation models such as SatMAE (Cong et al., 2022). Notably, these enhancements are achieved without the reliance on textual descriptions, without introducing any task-specific parameters, without training from scratch and without catastrophic forgetting. Our work highlights the potential of leveraging existing VLMs’ large-scale pre-training and extending their zero-shot capabilities to specialized fields, paving the way for resource efficient establishment of in-domain multi-modal foundation models in RS and beyond. We make our code implementation and weights for all experiments publicly available on our project’s GitHub repository <span><span>https://github.com/Orion-AI-Lab/MindTheModalityGap</span><svg><","PeriodicalId":50269,"journal":{"name":"ISPRS Journal of Photogrammetry and Remote Sensing","volume":"228 ","pages":"Pages 270-287"},"PeriodicalIF":10.6,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144670817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chunqing Ran , Luotao Zhang , Shuo Han , Xiaobo Zhang , Shengli Wang , Xinghua Zhou
{"title":"Three-dimensional reconstruction of shallow seabed topographic surface based on fusion of side-scan sonar and echo sounding data","authors":"Chunqing Ran , Luotao Zhang , Shuo Han , Xiaobo Zhang , Shengli Wang , Xinghua Zhou","doi":"10.1016/j.isprsjprs.2025.07.018","DOIUrl":"10.1016/j.isprsjprs.2025.07.018","url":null,"abstract":"<div><div>High-precision topographic mapping of offshore shallow seabed has great significance in a number of fields, including shipping navigation, disaster warning, environmental monitoring and resource management. However, conventional side-scan sonar (SSS) techniques are difficult to obtain seabed elevation data, which limits their application in the field of three-dimensional (3D) topographic reconstruction. Meanwhile, although single-beam echo sounder (SBES) can provide accurate depth information, it is difficult to capture the details of complex terrain due to sparse spatial coverage. In order to overcome the limitations of a single technique in 3D seafloor topographic reconstruction applications, this study fuses SSS and SBES data, and proposes the Multi-Scale Gradient Fusion Shape From Shading (MSGF-SFS) algorithm. This algorithm extracts and fuses surface gradient information by analyzing the intensity variations in SSS images at multiple scales. This enables the construction of a 3D discrete elevation model from two-dimensional (2D) SSS data. In order to reduce the inherent elevation error of SSS, the topographic feature extraction and least squares optimization for multi-source data alignment and correction algorithm is introduced, which combines terrain feature extraction and least squares optimization to fuse the SBES depth data with the 3D discrete elevation model for calibration. The quality of the 3D discrete elevation model was then optimized by the data filtering based on quadtree domain partitioning and least squares function. Finally, a high-resolution 3D continuous seabed model was constructed on the basis of the filtered data using implicit function based on Undirected Distance Function (IF-UDF) deep learning algorithm. Based on the above methods, this study realized the 3D seabed topography reconstruction of an offshore area in the Yellow Sea of China and conducted comparative experiments. The findings demonstrate that a series of methods in this paper can effectively reconstruct a fine 3D seabed model, and the obtained model is better than the existing 3D reconstruction techniques in terms of normal consistency and continuity, and shows stronger robustness and higher accuracy than the traditional algorithms. This method provides a systematic and practical solution for high-resolution offshore topographic mapping, especially for high-precision requirements in complex environments, and can effectively serve as an alternative to multibeam systems in the field of offshore topography mapping.</div></div>","PeriodicalId":50269,"journal":{"name":"ISPRS Journal of Photogrammetry and Remote Sensing","volume":"228 ","pages":"Pages 249-269"},"PeriodicalIF":10.6,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144670816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haifeng Wang , Wei He , Zhuohong Li , Naoto Yokoya
{"title":"Cross-scenario damaged building extraction network: Methodology, application, and efficiency using single-temporal HRRS imagery","authors":"Haifeng Wang , Wei He , Zhuohong Li , Naoto Yokoya","doi":"10.1016/j.isprsjprs.2025.06.028","DOIUrl":"10.1016/j.isprsjprs.2025.06.028","url":null,"abstract":"<div><div>The extraction of damaged buildings is of significant importance in various fields, such as disaster assessment and resource allocation. Although multi-temporal-based methods exhibit remarkable advantages in detecting damaged buildings, single-temporal extraction remains crucial in real-world emergency responses due to its immediate usability. However, single-temporal cross-scenario extraction at high-resolution remote sensing (HRRS) encounters the following challenges: (i) morphological heterogeneity of building damage which causes by the interplay of unknown disaster types with unpredictable geographic contexts, and (ii) scarcity of fine-grained annotated datasets for unseen disaster scenarios which limits the accuracy of rapid damage mapping. Confronted with these challenges, our main idea is to decompose complex features of damaged building into five attribute-features, which can be trained using historical disaster data to enable the independent learning of both building styles and damage features. Consequently, we propose a novel Correlation Feature Decomposition Network (CFDNet) along with a coarse-to-fine training strategy for the cross-scenario damaged building extraction. In detail, at the coarse training stage, the CFDNet is trained to decompose the damaged building segmentation task into the extraction of multiple attribute-features. At the fine training stage, specific attribute-features, such as building feature and damage feature, are trained using auxiliary datasets. We have evaluated CFDNet on several datasets that cover different types of disasters and have demonstrated its superiority and robustness compared with state-of-the-art methods. Finally, we also apply the proposed model for the damaged building extraction in areas historically affected by major disasters, namely, the Turkey–Syria earthquakes on 6 February 2023, Cyclone Mocha in the Bay of Bengal on 23 May 2023, and Hurricane Ian in Florida, USA in September 2022. Results from practical applications also emphasize the significant advantages of our proposed CFDNet.</div></div>","PeriodicalId":50269,"journal":{"name":"ISPRS Journal of Photogrammetry and Remote Sensing","volume":"228 ","pages":"Pages 228-248"},"PeriodicalIF":10.6,"publicationDate":"2025-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144663363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tianyi Zhang , Yi Ren , Weibin Li , Chenhao Qin , Licheng Jiao , Hua Su
{"title":"CSW-SAM: a cross-scale algorithm for very-high-resolution water body segmentation based on segment anything model 2","authors":"Tianyi Zhang , Yi Ren , Weibin Li , Chenhao Qin , Licheng Jiao , Hua Su","doi":"10.1016/j.isprsjprs.2025.07.008","DOIUrl":"10.1016/j.isprsjprs.2025.07.008","url":null,"abstract":"<div><div>Large-scale high-resolution water body (WB) extraction is one of the research hotspots in remote sensing image processing. However, accurate training labels for various WBs at Very-High-Resolution (VHR) are extremely scarce. Considering that low-resolution (LR) images and labels are more easily accessible, the challenge lies in fully leveraging LR data to guide high-precision WB extraction from VHR images. To address this issue, we propose a novel cross-scale CSW-SAM algorithm based on SAM2, which learns spectral information of WBs from easily accessible 10 m resolution LR images and maps it to 0.3 m resolution VHR remote sensing images for high-precision WB segmentation. In addition to fine-tuning the decoder, we enhance the encoder’s ability to effectively learn the mapping relationship between images of different resolutions by Adapter Tuning. We have designed the Automated Clustering Layer (ACL) based on the principle of feature similarity and local structure information clustering, to enhance the performance of SAM-based methods in cross-scale WB segmentation. To validate the robustness and generalization ability of the proposed CSW-SAM, we conducted extensive experiments on both a self-constructed cross-scale WB dataset and the publicly available GLH-Water dataset. The results confirm that CSW-SAM achieves strong performance across datasets with diverse WB conditions, demonstrating its potential for scalable and low-cost VHR WB mapping. Additionally, the model can be generalized with minimal cost, making it highly promising for large-scale global VHR WB mapping.</div></div>","PeriodicalId":50269,"journal":{"name":"ISPRS Journal of Photogrammetry and Remote Sensing","volume":"228 ","pages":"Pages 208-227"},"PeriodicalIF":10.6,"publicationDate":"2025-07-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144663364","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yufu Zang , Liu Xu , Zhen Cui , Xiongwu Xiao , Haiyan Guan , Bisheng Yang
{"title":"FTG-Net: A facade topology-aware graph network for class imbalance structural segmentation of building facades","authors":"Yufu Zang , Liu Xu , Zhen Cui , Xiongwu Xiao , Haiyan Guan , Bisheng Yang","doi":"10.1016/j.isprsjprs.2025.07.014","DOIUrl":"10.1016/j.isprsjprs.2025.07.014","url":null,"abstract":"<div><div>Digital twin city and realistic 3D scene have triggered an ever-increasing demand for high-precision building models. As an important component for urban models, façade segmentation based on point clouds has gained significant attention. However, most existed networks suffer from class imbalance of façade elements and inherent limitations in point clouds (e.g., various occlusions, significant noise or outliers, varying point densities). To address these issues, we propose a novel FTG-Net (Façade Topology-aware Graph Network) combining the façade topology and hierarchical geometric features for robust segmentation. Our framework comprises three key modules: (1) A Façade Topology Extraction (FTE) module that encodes object-level spatial relationships via a 2D manifold grid and topology-aware graph convolutions; (2) A Sampling-enhanced Geometry Extraction (SGE) module leveraging adaptive reweighted sampling and strip pooling to enhance rare-class feature learning; (3) A Dual-feature Attentive Fusion (DAF) module that adaptively fuses topology and geometric features. To validate the performance of FTG-Net, we annotated two building façade datasets (NUIST Façade dataset and Commercial Street dataset) and selected two benchmark datasets (ArCH and ZAHA datasets) for evaluation. Extensive experiments on annotated datasets demonstrate state-of-the-art performance, achieving 98.33 % overall accuracy (OA) and 96.08 % mIoU. Evaluations on benchmark datasets show mIoU improvements of 1.05 ∼ 6.7 % over existing methods, with improvements focused on rare-class categories. Ablation studies confirm the critical role of our topology-aware design in capturing spatial regularities (e.g., repetitive arranged balconies) and the adaptive sampling strategy in mitigating class imbalance. These demonstrate the effectiveness of our FTG-Net for diverse architectural styles and the applicability in digital twin city modeling. Code and datasets are publicly available: <span><span>https://github.com/zangyufus/FTG-Net</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50269,"journal":{"name":"ISPRS Journal of Photogrammetry and Remote Sensing","volume":"228 ","pages":"Pages 179-207"},"PeriodicalIF":10.6,"publicationDate":"2025-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144656656","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Min Dang , Gang Liu , Adams Wai-Kin Kong , Zhaolu Zheng , Nan Luo , Rong Pan
{"title":"RO2-DETR: Rotation-equivariant oriented object detection transformer with 1D rotated convolution kernel","authors":"Min Dang , Gang Liu , Adams Wai-Kin Kong , Zhaolu Zheng , Nan Luo , Rong Pan","doi":"10.1016/j.isprsjprs.2025.06.029","DOIUrl":"10.1016/j.isprsjprs.2025.06.029","url":null,"abstract":"<div><div>Many oriented object detectors have achieved significant performance by adopting convolutional neural networks (CNNs). However, traditional CNNs cannot explicitly model orientation information, which means they have limitations when dealing with oriented objects. DEtection TRansformer (DETR), a Transformer-based end-to-end object detection framework, has made great progress and demonstrated strong capabilities. Compared with CNN, the Transformer decoder has a stronger ability to perceive the orientation information of oriented objects. However, DETR still has difficulty in accurately detecting objects with arbitrary orientations in aerial images due to the misalignment between axis-aligned features and oriented objects. In this paper, we propose a rotation-equivariant oriented object detection transformer, called RO<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span>-DETR, for oriented object detection in aerial images. Specifically, the rotation-equivariant module aims to model the orientation of objects, thereby enhancing the oriented feature representation of the backbone network. Then, an oriented deformable decoder is designed by embedding the orientation information into the encoder to select sample points with an oriented bounding box (OBB) and two-dimensional (2D) Gaussian distribution. In addition, to optimize the matching strategy, a one-to-many (o2m) matching scheme is used to dynamically adjust the number of queries during training to enhance representation learning. Comprehensive experiments on three challenging datasets—DOTA, HRSC2016, and DIOR-R—yielded mAP<sub>50</sub> scores of 77.82%, 97.47%, and 66.43%, respectively. These results demonstrate that RO<span><math><msup><mrow></mrow><mrow><mn>2</mn></mrow></msup></math></span>-DETR achieves competitive performance compared to state-of-the-art oriented object detection methods. Our code is available at <span><span>https://github.com/DangMinmin/RO2-DETR</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50269,"journal":{"name":"ISPRS Journal of Photogrammetry and Remote Sensing","volume":"228 ","pages":"Pages 166-178"},"PeriodicalIF":10.6,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144633661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Siyuan Wen , Wenzhi Zhao , Fengcheng Ji , Rui Peng , Liqiang Zhang , Qiao Wang
{"title":"“Phenology description is all you need!” mapping unknown crop types with remote sensing time-series and LLM generated text alignment","authors":"Siyuan Wen , Wenzhi Zhao , Fengcheng Ji , Rui Peng , Liqiang Zhang , Qiao Wang","doi":"10.1016/j.isprsjprs.2025.07.002","DOIUrl":"10.1016/j.isprsjprs.2025.07.002","url":null,"abstract":"<div><div>Accurate crop monitoring is crucial for global food security and sustainable agricultural management. Previously published studies are almost all trained in closed environments, limiting their capability to generalize beyond the training areas or to unseen crop categories, thus severely constraining the scalability and adaptability of crop monitoring. Zero-shot learning (ZSL)-based classification methods have made significant progress recently and provide an effective solution to the above challenges, which establishes the connection between seen and unseen categories through semantic knowledge. However, the semantic knowledge extracted by these methods often lacks the domain-specific details to distinguish different crop types. To this end, we propose a novel contrastive learning framework to explore the application of zero-shot learning in crop classification for the first time. Specifically, our method extracts visual features from time-series patches and the corresponding curves, then utilizes a large language model (LLM) to automatically generate high-quality time-series text descriptions. These descriptions provide unique phenological information and growth patterns for each crop type. Additionally, we further process keywords related to phenological information and growth patterns through a graph convolutional network (GCN) to effectively capture the interrelated phenological stages and spatial dependencies. Experimental results on three different study areas demonstrate that our approach outperforms traditional supervised learning methods for crop classification as well as ZSL baseline methods. Our findings highlight the effectiveness and interpretability of leveraging time series data to explore visual information and semantic knowledge of crops for zero-shot crop classification across diverse regions and crop types. Code and pretrained model are available at <span><span>https://github.com/Shawie66/Phenology-Description-Is-All-You-Need</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50269,"journal":{"name":"ISPRS Journal of Photogrammetry and Remote Sensing","volume":"228 ","pages":"Pages 141-165"},"PeriodicalIF":10.6,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144633660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Deep learning-based road extraction from remote sensing imagery: Progress, problems, and perspectives","authors":"Xiaoyan Lu , Qihao Weng","doi":"10.1016/j.isprsjprs.2025.07.013","DOIUrl":"10.1016/j.isprsjprs.2025.07.013","url":null,"abstract":"<div><div>Accurate and up-to-date mapping and extraction of road networks are essential for maintaining urban functionality and fostering socioeconomic development, particularly in realizing intelligent transport systems and smart city management. Recent advancements in Earth observation and artificial intelligence technologies have facilitated more efficient and accurate extraction of road networks from large volumes of remote sensing imagery. To investigate these developments, we conducted a comprehensive review of peer-reviewed literature published between 2017 and 2024, by examining three aspects: data, methods, and applications. This review revealed key trends in deep learning-based road extraction from remote sensing imagery, including a shift from raster to vector approaches, from local-scale to global-scale studies, and from pixel-level recognition to practical applications. Additionally, to achieve high-precision, global-scale road vector extraction, we highlight three emerging research directions: 1) vectorized extraction of complex viaducts; 2) integration of multimodal remote sensing data; and 3) the development of novel applications to foster scientific discoveries. Advancing research in these areas will have profound implications for traffic management, urban planning, disaster response, and the analysis of socio-economic dynamics. Furthermore, this review collects and shares open-source datasets and code related to road extraction to support future research, available at <span><span>https://github.com/RCAIG/GRE-Hub</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50269,"journal":{"name":"ISPRS Journal of Photogrammetry and Remote Sensing","volume":"228 ","pages":"Pages 122-140"},"PeriodicalIF":10.6,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144633247","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhicheng Zhao , Juanjuan Gu , Chenglong Li , Chun Wang , Zhongling Huang , Jin Tang
{"title":"Guidance disentanglement network for Optics-Guided thermal UAV image super-resolution","authors":"Zhicheng Zhao , Juanjuan Gu , Chenglong Li , Chun Wang , Zhongling Huang , Jin Tang","doi":"10.1016/j.isprsjprs.2025.06.011","DOIUrl":"10.1016/j.isprsjprs.2025.06.011","url":null,"abstract":"<div><div>Optics-guided Thermal UAV image Super-Resolution (OTUAV-SR) has attracted significant research interest due to its potential applications in security inspection, agricultural measurement, and object detection. Existing methods often employ single guidance model to generate the guidance features from optical images to assist thermal UAV images super-resolution. However, single guidance models make it difficult to generate effective guidance features under favorable and adverse conditions in UAV scenarios, thus limiting the performance of OTUAV-SR. To address this issue, we propose a novel Guidance Disentanglement network (GDNet), which disentangles the optical image representation according to typical UAV scenario attributes to form guidance features under both favorable and adverse conditions, for robust OTUAV-SR. Moreover, we design an attribute-aware fusion module to combine all attribute-based optical guidance features, which could form a more discriminative representation and fit the attribute-agnostic guidance process. To facilitate OTUAV-SR research in complex UAV scenarios, we introduce VGTSR2.0, a large-scale benchmark dataset containing 3,500 aligned optical-thermal image pairs captured under diverse conditions and scenes. Extensive experiments on VGTSR2.0 demonstrate that GDNet significantly improves OTUAV-SR performance over state-of-the-art methods, especially in the challenging low-light and foggy environments commonly encountered in UAV scenarios. The dataset and code will be publicly available at <span><span>https://github.com/Jocelyney/GDNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50269,"journal":{"name":"ISPRS Journal of Photogrammetry and Remote Sensing","volume":"228 ","pages":"Pages 64-82"},"PeriodicalIF":10.6,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144632846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}