{"title":"Two-Stage Feature Selection for Fine-Grained Image Recognition Via Partial Order Analysis and Heterogeneity Evaluation","authors":"Hongli Gao, Sulan Zhang, Huiyuan Zhou, Lihua Hu, Jifu Zhang","doi":"10.1049/ipr2.70088","DOIUrl":"https://doi.org/10.1049/ipr2.70088","url":null,"abstract":"<p>The core challenge of fine-grained image recognition (FGIR) tasks is distinguishing highly similar subclasses within the same base category. Most CNN-based deep learning methods typically focus on extracting information from local regions while overlook the inherent structure between subclasses and the complex relationships between features. This paper presents a two-stage feature selection method based on partial order analysis (POA) and heterogeneity evaluation (HE) for FGIR tasks, guiding the model to focus on distinctive features while reducing uncertainty caused by interfering information. Specifically, in the POA stage, clustering first groups similar subcategories into a medium-granularity category. Formal concept analysis then models their hierarchical partial order, identifying “shared features” among subcategories and “exclusive features” unique to each. This structured representation highlights key contrastive cues. In the HE stage, a novel heterogeneity index is introduced to measure the fluctuation of low-level features within each fine-grained category. This index guides the model to suppress pseudo-discriminative features with high heterogeneity, mitigating the impact of noisy and unstable information on decision-making. We perform comprehensive experiments on three commonly used benchmark datasets (CUB-200-2011, Stanford Cars, and FGVC-Aircraft). Experimental results show that the proposed method outperforms classic FGIC methods, validating the effectiveness of our approach.</p>","PeriodicalId":56303,"journal":{"name":"IET Image Processing","volume":"19 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/ipr2.70088","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143879730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Unsupervised Retinex Exposure Control: A Novel Approach to Image Enhancement","authors":"Yukun Yang, Libo Sun, Weipeng Shi, Wenhu Qin","doi":"10.1049/ipr2.70077","DOIUrl":"https://doi.org/10.1049/ipr2.70077","url":null,"abstract":"<p>In domains such as autonomous driving and remote sensing, images often suffer from challenging lighting conditions, including low-light, backlighting and overexposure, which hinder the recognition of pedestrians, vehicles and traffic signs. While numerous methods have been proposed to address poor image exposure, they often struggle with images containing both low-light and overexposed regions. This paper presents an unsupervised learning-based exposure control method, providing a novel approach to improving image quality under diverse lighting conditions. Leveraging the inherent properties of Retinex theory, we introduce a novel yet simple formula that adjusts image exposure to produce visually pleasing results without requiring paired training data. Experiments on diverse image datasets validate the effectiveness of our approach in addressing various exposure challenges while preserving critical visual details. Our framework not only simplifies the exposure control process but also achieves state-of-the-art performance, highlighting its potential for real-world applications in computer vision and image processing.</p>","PeriodicalId":56303,"journal":{"name":"IET Image Processing","volume":"19 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/ipr2.70077","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143879731","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Causal Attention Transformer for Video Text Retrieval","authors":"Hua Lan, Chaohui Lv","doi":"10.1049/ipr2.70093","DOIUrl":"https://doi.org/10.1049/ipr2.70093","url":null,"abstract":"<p>In the metaverse, video text retrieval is an urgent and challenging need for users in social entertainment. The current attention-based video text retrieval models have not fully explored the interaction between video and text, and only brute force feature embedding. Moreover, Due to the unsupervised nature of attention weight training, existing models have weak generalization performance for dataset bias. Essentially, the model learns that false relevant information in the data is caused by confounding factors. Therefore, this article proposes a video text retrieval method based on causal attention transformer. Assuming that the confounding factors affecting the performance of video text retrieval all come from the dataset, a structural causal model that conforms to the video text retrieval task is constructed, and the impact of confounding effects during data training is reduced by adjusting the front door. In addition, we use causal attention transformer to construct a causal inference network to extract causal features between video text pairs, and replace the similarity statistical probability with causal probability in the video text retrieval framework. Experiments are conducted on the MSR-VTT, MSVD, and LSMDC datasets, which proves the effectiveness of the retrieval model proposed in this paper.</p>","PeriodicalId":56303,"journal":{"name":"IET Image Processing","volume":"19 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-04-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/ipr2.70093","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143875584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Liang Wang, Zhao Wang, Shaokang Zhang, Meng Wang, Haibo Liu
{"title":"SMG-MATSM: Scene Memory Generation Based on Motion-Aware Temporal Style Modulation","authors":"Liang Wang, Zhao Wang, Shaokang Zhang, Meng Wang, Haibo Liu","doi":"10.1049/ipr2.70083","DOIUrl":"https://doi.org/10.1049/ipr2.70083","url":null,"abstract":"<p>Scene memory generation (SMG) refers to training AI agents to recall scene memories similarly to the human brain. This is the key work to realize the artificial memory system. The challenge is to generate scenes rich in motion and keep it realistic while ensuring temporal consistency. Inspired by the principles of memory function in brain neuroscience, this paper proposes a motion-aware scene generation model named SMG based on motion-aware temporal style modulation (SMG-MATSM), which ensures temporal consistency by redesigning the temporal latent representation and constructing a motion matrix to guide the motion of intermediate latent variables. The motion matrix preserves motion consistency in the scene memory through both the cosine similarity and the Mahalanobis distance of intermediate latent variables of adjacent frames. Additionally, SMG-MATSM uses a style-based approach and enhances conditional features through the motion matrix during the scene memory synthesis process. Experimental results show that SMG-MATSM has better effect of action-enriched scene memory generation, and has varying degrees of efficiency improvement on different datasets with Frechet video distance and Frechet inception distance evaluation metrics.</p>","PeriodicalId":56303,"journal":{"name":"IET Image Processing","volume":"19 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/ipr2.70083","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143871821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peixue Liu, Mingze Sun, Xinyue Han, Shu Liu, Yujie Chen, Han Zhang
{"title":"A High-Accuracy YOLOv8-ResAttNet Framework for Maritime Vessel Detection Using Residual Attention","authors":"Peixue Liu, Mingze Sun, Xinyue Han, Shu Liu, Yujie Chen, Han Zhang","doi":"10.1049/ipr2.70085","DOIUrl":"https://doi.org/10.1049/ipr2.70085","url":null,"abstract":"<p>Against the backdrop of constantly upgrading maritime security requirements and dynamic marine environments, satellite based ship detection has become a key technology for national maritime surveillance, resource management, and environmental protection. However, existing methods often struggle to address ongoing challenges, including insufficient sensitivity to small vessels and susceptibility to errors or missed detections in complex ocean backgrounds caused by wave reflections, cloud cover, and lighting changes. To address these limitations, this study proposes YOLOv8 ResAttNet, an enhanced model that integrates residual learning and attention mechanisms into the YOLOv8 framework. The core innovation lies in a custom designed backbone network that combines multi-scale feature aggregation with an improved ICBAM attention module to achieve precise localization of ship targets while suppressing irrelevant background noise. This architecture dynamically recalibrates feature channel weights through residual attention blocks, enhancing the model's ability to distinguish subtle ship features (such as hull contours and superstructures) in different maritime scenarios. Extensive experiments on high-resolution HRSID datasets have demonstrated the superiority of this model: the average accuracy (mAP50) of YOLOv8 ResAttNet is 95.2%, which is 4.9% higher than the original YOLOv8 and over 4% higher than state-of-the-art models such as YOLO SENet and YOLO11. These improvements highlight its robustness in handling scale changes and complex background interference. The research results emphasize the effectiveness of combining residual connectivity with attention driven feature refinement for maritime target detection, especially in small target scenes. This work not only advances the technological frontier of remote sensing image analysis, but also provides a scalable framework for real-world applications such as illegal fishing monitoring, maritime traffic management, and disaster response. Future research directions include extending the model to multimodal satellite data fusion, optimizing the computational efficiency of edge device deployment, and further bridging the gap between theoretical innovation and maritime surveillance systems.</p>","PeriodicalId":56303,"journal":{"name":"IET Image Processing","volume":"19 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/ipr2.70085","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143861798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yake Zhang, Yufan Zhao, Jianlong Wang, Zhengwei Xu, Dong Liu
{"title":"Dual Attention Transformers: Adaptive Linear and Hybrid Cross Attention for Remote Sensing Scene Classification","authors":"Yake Zhang, Yufan Zhao, Jianlong Wang, Zhengwei Xu, Dong Liu","doi":"10.1049/ipr2.70076","DOIUrl":"https://doi.org/10.1049/ipr2.70076","url":null,"abstract":"<p>Vision Transformers (ViTs) have demonstrated strong capabilities in capturing global contextual information compared to convolutional neural networks, making them promising for remote sensing image analysis. However, ViTs often overlook critical local features, limiting their ability to accurately interpret intricate scenes. To address this issue, we propose an adaptive linear hybrid cross attention transformer (ALHCT). It integrates adaptive linear (AL) attention and hybrid cross (HC) attention to simultaneously learn local and global features. AL is introduced into ViT, as it helps reduce computational complexity from exponential to linear scale. Furthermore, ALHCT incorporates two adaptive linear swin transformers (ALST) to achieve multi-scale feature representation, enabling the model to capture high-level semantics and fine details. Finally, to enhance global perception and discriminative power, HC attention fuse local and global features which captured by the two ALST. Experiments on three remote sensing datasets demonstrate that ALHCT significantly improves classification accuracy, outperforming several state-of-the-art methods, validating its effectiveness in classifying complex remote sensing scenes.</p>","PeriodicalId":56303,"journal":{"name":"IET Image Processing","volume":"19 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/ipr2.70076","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143861455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Research on Intelligent and High-Precision Structure-Recognition Methods for Field Geological Outcrop Images","authors":"Mingguang Diao, Kaixuan Liu, Shupeng Wang, Chuyan Zhang","doi":"10.1049/ipr2.70087","DOIUrl":"https://doi.org/10.1049/ipr2.70087","url":null,"abstract":"<p>The accurate recognition of geological structures in field outcrop images is critical for applications such as geological hazard analysis, seismic risk assessment, and urban geological planning. However, traditional manual interpretation of geological images is time-consuming, labor-intensive, and subjective, limiting its scalability and precision. To address this gap, this study proposes an intelligent, automated recognition method for field geological outcrop images based on deep learning techniques. The methodology integrates Fourier transform, Canny edge detection, and Mask R-CNN instance segmentation, enhanced with image normalization and data augmentation strategies such as grayscale conversion, Gaussian filtering, and rotation. A custom dataset comprising 4260 images was constructed and annotated using a hybrid approach involving edge detection and expert labeling. The proposed model, improved with PrRoI Pooling, outperforms conventional models such as YOLOv3, Faster R-CNN, and standard Mask R-CNN, achieving a mean average precision (mAP) of 90.77% in detecting fault, fold, and sausage-like geological structures. The results demonstrate the model's robustness, accuracy, and suitability for complex geological environments. This study not only advances the state-of-the-art in geological image recognition but also lays a foundation for future research into broader structural classification, multi-modal geological data integration, and real-time field deployment.</p>","PeriodicalId":56303,"journal":{"name":"IET Image Processing","volume":"19 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/ipr2.70087","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143857171","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hongmei Wang, Xuanyu Lu, Zhuofan Wu, Ruolin Li, Jingyu Wang
{"title":"Infrared and Visible Image Fusion Based on Autoencoder Network","authors":"Hongmei Wang, Xuanyu Lu, Zhuofan Wu, Ruolin Li, Jingyu Wang","doi":"10.1049/ipr2.70086","DOIUrl":"https://doi.org/10.1049/ipr2.70086","url":null,"abstract":"<p>To overcome the problems of texture information loss and insufficiently prominent targets in existing fusion networks, an information decomposition-based autoencoder fusion network for infrared and visible images is proposed in this paper. Two salient information encoders with unshared weights and two scene information encoders with shared weights are designed to extract different features from infrared and visible images, respectively. The constraint is added to the loss function in order to ensure the ability of the salient information encoders to extract representative features and the scene information encoder to extract the cross-modality feature. In addition, by introducing the pre-trained semantic segmentation networks to guide the network training and constructing a feature saliency-based fusion strategy, the ability of the fusion network is further enhanced to distinguish between targets and backgrounds. Extensive experiments are carried out on five datasets. Comparison experiments with state-of-the-art fusion networks and ablation experiments indicate that the proposed method can obtain fused images with richer and more comprehensive information and is more robust to challenging factors, such as strong and weak light smoke and fog environments. At the same time, the fused images by our proposed method are more beneficial for downstream tasks such as target detection.</p>","PeriodicalId":56303,"journal":{"name":"IET Image Processing","volume":"19 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/ipr2.70086","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143857172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-Directional Transformer Image Super-Resolution Network Based on Information Enhancement","authors":"RongGui Wang, Xu Chen, Juan Yang, LiXia Xue","doi":"10.1049/ipr2.70074","DOIUrl":"https://doi.org/10.1049/ipr2.70074","url":null,"abstract":"<p>With the advancement of deep learning, single-image super-resolution (SISR) has achieved significant progress. Recently, vision transformer-based super-resolution models have demonstrated remarkable performance; however, their high computational cost hinders their practical application. In this paper, we introduce a lightweight transformer-based super-resolution model termed information-enhanced efficient multi-directional transformer(IEMT). The model employs a dual-branch architecture that integrates the strengths of both convolutional neural network (CNN) and transformer networks. The proposed high-frequency extraction block (HEB) effectively captures high-frequency information from the enhanced image. Furthermore, a multi-directional attention mechanism is incorporated into the transformer branch to comprehensively learn latent features and details, thereby enhancing reconstruction quality. For attention computation, we propose a dynamic parameter-sharing mechanism that adaptively adjusts parameter sharing based on local image features, significantly reducing the model's parameter count. Experimental results demonstrate that the proposed IEMT achieves superior performance on five benchmark datasets, with a significantly reduced parameter count, computational complexity, and memory usage.</p>","PeriodicalId":56303,"journal":{"name":"IET Image Processing","volume":"19 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/ipr2.70074","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143857170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yang Zheng, Qing Li, Jiangyun Li, Zhenghao Xi, Jie Liu
{"title":"Enhancing Semantic Information Representation in Multi-View Geo-Localization through Dual-Branch Network with Feature Consistency Enhancement and Multi-Level Feature Mining","authors":"Yang Zheng, Qing Li, Jiangyun Li, Zhenghao Xi, Jie Liu","doi":"10.1049/ipr2.70071","DOIUrl":"https://doi.org/10.1049/ipr2.70071","url":null,"abstract":"<p>Metric learning is fundamental to multi-view geo-localization, as it aims to establish a distance metric that minimizes the feature space distance between similar data points while maximizing the separation between dissimilar ones. However, in Siamese networks employed for metric learning, individual branches may exhibit discrepancies in their interpretation of semantic information from input data, resulting in semantically inconsistent feature representations. To address this issue, a method is designed to enhance significant region consistency within multi-view spaces by integrating feature consistency enhancement (FCE) and multi-level feature mining (MLFM) techniques into a dual-branch network. The FCE method emphasizes critical components of the input data, ensuring feature consistency between the two branches. Additionally, the MLFM mechanism facilitates feature integration across multiple levels, thereby enabling a more comprehensive extraction of semantic information. This approach enhances semantic understanding and promotes feature consistency across branches. The proposed method achieves AP values of 82.38% for drone-to-satellite and 77.36% for satellite-to-drone image matching. Notably, the method maintains computational efficiency without significantly affecting inference time. Additionally, improvements are observed in R@1, R@5 and R@10 metrics. The experimental results show that integrating FCE and MLFM into the dual-branch network improves semantic representation and outperforms existing methods.</p>","PeriodicalId":56303,"journal":{"name":"IET Image Processing","volume":"19 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1049/ipr2.70071","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143853021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}