Dario Zanca;Andrea Zugarini;Simon Dietz;Thomas R. Altstidl;Mark A. Turban Ndjeuha;Moumita Chakraborty;Naga Venkata Sai Jitin Jami;Leo Schwinn;Bjoern M. Eskofier
{"title":"Contrastive Language-Image Pretrained Models are Zero-Shot Human Scanpath Predictors","authors":"Dario Zanca;Andrea Zugarini;Simon Dietz;Thomas R. Altstidl;Mark A. Turban Ndjeuha;Moumita Chakraborty;Naga Venkata Sai Jitin Jami;Leo Schwinn;Bjoern M. Eskofier","doi":"10.1109/TAI.2025.3612905","DOIUrl":"https://doi.org/10.1109/TAI.2025.3612905","url":null,"abstract":"Understanding human attention mechanisms is crucial for advancing both vision science and artificial intelligence. While numerous computational models of free-viewing have been proposed, less is known about the mechanisms underlying task-driven image exploration. To address this gap, we introduce NevaClip, a novel zero-shot method for predicting visual scanpaths. NevaClip leverages contrastive language-image pretrained (CLIP) models in conjunction with human-inspired neural visual attention (NeVA) algorithms. By aligning the representation of foveated visual stimuli with associated captions, NevaClip uses gradient-driven visual exploration to generate scanpaths that simulate human attention. We also present CapMIT1003, a new dataset comprising captions and click-contingent image explorations collected from participants engaged in a captioning task. Based on the established MIT1003 benchmark, which includes eye-tracking data from free-viewing conditions, CapMIT1003 provides a valuable resource for studying human attention across both free-viewing and task-driven contexts. Additionally, we demonstrate NevaClip’s performance on the publicly available AiR-D dataset, which includes visual question answering (VQA) tasks. Experimental results show that NevaClip outperforms existing unsupervised computational models in scanpath plausibility across captioning, VQA, and free-viewing tasks. Furthermore, we demonstrate that NevaClip’s performance is sensitive to caption accuracy, with misleading captions leading to inaccurate scanpath behaviors. This underscores the importance of caption guidance in attention prediction and highlights NevaClip’s potential to advance our understanding of task-driven human attention mechanisms. Together, NevaClip and CapMIT1003 offer significant contributions to the field, providing new tools for studying and simulating human visual attention.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"7 4","pages":"2157-2170"},"PeriodicalIF":0.0,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147578984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CIGformer: Leveraging Continuous Information Guidance in Transformer-Based Pansharpening","authors":"Hao Zhu;Yuan Wang;Xiaotong Li;Biao Hou;Bo Ren;Yifan Meng;Kefan Chen;Licheng Jiao","doi":"10.1109/TAI.2025.3614584","DOIUrl":"https://doi.org/10.1109/TAI.2025.3614584","url":null,"abstract":"Pansharpening has been a critical area of interest. However, current methods often fail to fully leverage the correlation between panchromatic (PAN) and multispectral (MS) images, resulting in spatial and spectral information loss. In addition, uniform processing across different scales can cause information interference. We introduce CIGformer, a multiscale fusion network utilizing a continuous information guidance mechanism to tackle these issues. Our approach introduces an intensity substitute block (ISB) to separate shared and unique features of PAN and MS, setting the stage for subsequent guidance. The core component, an guidance block (IGB) based on Transformer architecture, ensures adaptive retention of unique information while utilizing shared information. Furthermore, our multilevel encoder–decoder bidirectional pyramid structure minimizes multiscale information mixing, with IGB applied at each encoder level for optimal information use. A consistency loss function is also introduced to enhance training by assessing unique information retention. Our method significantly enhances the efficiency of using PAN and MS images through distinctive information guidance, as demonstrated by experiments on the GaoFen-1 (GF-1), GaoFen-2 (GF-2), and WorldView-3 (WV-3) datasets.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"7 4","pages":"2307-2320"},"PeriodicalIF":0.0,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147578988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lianmeng Jiao;Xianggang Ma;Han Zhang;Jiawei Wu;Haonan Ma
{"title":"Dynamic Belief Rule Learning for Classification With Expert in the Loop","authors":"Lianmeng Jiao;Xianggang Ma;Han Zhang;Jiawei Wu;Haonan Ma","doi":"10.1109/TAI.2025.3612317","DOIUrl":"https://doi.org/10.1109/TAI.2025.3612317","url":null,"abstract":"The belief rule-based classification system (BRBCS) has been explored as an effective and promising framework for designing classifiers, owing to its ability to create user-friendly linguistic models and handle different types of uncertainty. Nevertheless, current BRBCSs operate in a static manner, which restricts their use in dynamic classification environments. To this end, the article develops a dynamic BRBCS (DBRBCS) by dynamically learning belief rules with an expert in the loop. First, a compact and interpretable model of the initial belief rule is constructed using the existing labeled data. Then, an algorithm for updating the belief rule base is developed to modify the rule parameters or add/delete rules from the rule base based on the sequential data and the labels fed by the expert. The proposed method can well address the dynamic classification tasks such as changes of feature distributions, emergence of new categories, and extinction of old categories. A case study of target classification and the comparative experiments with representative classification methods demonstrate the superiority of the proposed DBRBCS for various dynamic classification tasks.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"7 4","pages":"2142-2156"},"PeriodicalIF":0.0,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147579023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pietro Manganelli Conforti;Pietro Nardelli;Andrea Fanti;Paolo Russo
{"title":"Empowering Traditional Ensemble Learning Through Feature Learning and Wavelet Transforms for Environmental Analysis","authors":"Pietro Manganelli Conforti;Pietro Nardelli;Andrea Fanti;Paolo Russo","doi":"10.1109/TAI.2025.3611909","DOIUrl":"https://doi.org/10.1109/TAI.2025.3611909","url":null,"abstract":"The rapid expansion of urbanization and industrial activities has significantly increased atmospheric pollutants, posing critical risks to environmental sustainability and public health. To mitigate this issue, innovative and accurate air quality forecasting tools are essential to enable effective pollution monitoring and management. This study presents SR-ViT-FEL, an innovative deep-learning-based framework designed to enhance air quality forecasting by accurately predicting daily pollutant levels, such as carbon monoxide, by concurrently analyzing different environmental factors. The approach integrates time and frequency domain analyses via continuous wavelet transform and employs a novel ensemble learning strategy that integrates multilevel features extracted from both convolutional and transformer-based architectures. SR-ViT-FEL achieves superior predictive accuracy and adaptability when compared with various traditional monitoring settings. The findings indicate that SR-ViT-FEL not only improves predictive performance but also offers scalability for broader air quality monitoring applications, potentially reducing costs by accurately estimating multiple air quality parameters with fewer physical sensors.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"7 4","pages":"2112-2126"},"PeriodicalIF":0.0,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11174990","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147579024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards Generalizable Meta-Deep Reinforcement Learning Algorithm for Multiobjective Traveling Salesman Problems","authors":"Xiaoyu Fu;Shenshen Gu;Chee-Meng Chew;Tengfei Li","doi":"10.1109/TAI.2025.3614210","DOIUrl":"https://doi.org/10.1109/TAI.2025.3614210","url":null,"abstract":"The multiobjective traveling salesman problem (MOTSP) is a representative class of multiobjective combinatorial optimization problems, with significant implications for both theoretical research and practical applications. Although deep reinforcement learning (DRL) has shown promise in solving MOTSPs, existing approaches often struggle with generalization to large-scale problem instances. To address this challenge, we propose a novel meta-deep reinforcement learning framework with preference-fused attention networks (MDRL-PFAN). This framework integrates a preference-fused mechanism to jointly encode problem instances and weight preferences into a unified feature space. Moreover, an ensemble meta-learning strategy is adopted to train the meta-model across tasks with varying scales, equipping MDRL-PFAN with robust solving and strong cross-scale generalization capabilities. During inference, a lightweight fine-tuning process on small-batch adaptation tasks is employed to further enhance optimization performance. Extensive experiments on diverse MOTSP instances demonstrate that MDRL-PFAN achieves superior performance compared to classic evolutionary algorithms and state-of-the-art DRL algorithms in terms of training efficiency, solution quality, and cross-scale generalization capability.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"7 4","pages":"2238-2252"},"PeriodicalIF":0.0,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147579025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Explainable and Position-Aware Learning in Digital Pathology","authors":"Milan Aryal;Nasim Yahya Soltani;Masoud Ganji","doi":"10.1109/TAI.2025.3613475","DOIUrl":"https://doi.org/10.1109/TAI.2025.3613475","url":null,"abstract":"Due to their gigapixel resolutions, whole slide images (WSIs) pose significant computational challenges when using traditional machine learning approaches. Representing WSIs as graphs is a promising solution, allowing the entire image to be processed effectively using graph-based learning. In this approach, WSIs are divided into smaller patches, each serving as a node in the graph, with edges representing relationships between different patches. However, existing graph learning methods primarily rely on message passing between neighboring nodes and often neglect the position of patches in WSIs. As a result, patches located in topologically similar neighborhoods may produce nearly indistinguishable embeddings, reducing model discriminability. To address this limitation, a graph-based framework is introduced for cancer classification in WSIs, incorporating positional embeddings through spline-based convolutional neural networks and graph attention mechanisms. This approach captures both structural and spatial context, enhancing classification accuracy. Evaluation on WSI datasets for prostate and kidney cancer grading demonstrates improved performance compared with other approaches. In addition to classification, model interpretability is emphasized. A gradient-based saliency mapping technique is employed to identify and visualize the regions within WSIs that contribute most to the diagnostic predictions, thereby enhancing the explainability of the proposed method and supporting clinical decision-making.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"7 4","pages":"2186-2195"},"PeriodicalIF":0.0,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147579038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Priyanka Mishra;MD Raqib Khan;Shruti S. Phutke;Santosh Kumar Vipparthi;Subrahmanyam Murala
{"title":"TransWaveNet: Transformer for Underwater Image Restoration with Wavelets","authors":"Priyanka Mishra;MD Raqib Khan;Shruti S. Phutke;Santosh Kumar Vipparthi;Subrahmanyam Murala","doi":"10.1109/TAI.2025.3613670","DOIUrl":"https://doi.org/10.1109/TAI.2025.3613670","url":null,"abstract":"Underwater image restoration (UIR) aims to improve the quality and visibility of images taken in underwater environments. These images find application in diverse fields like marine biology research, underwater archaeology, environmental monitoring, surveillance tasks, and offshore infrastructure inspection. However, the complexities of the underwater environment make these applications challenging, as light scattering and absorption cause blur, color cast, and reduced contrast in images. With the promising results on restoring underwater degraded images, existing approaches limit their performance in the case of the above-mentioned complex and nonlinear degradation. In this research work, we propose a multidirectional wavelet coefficient space transformer model for underwater image deblurring and color restoration. Incorporating an attention mechanism within transformed spaces, our model dynamically adapts to underwater degradation. Additionally, we introduce a wavelet attention fusion transformer block (WAFTB) for attention computation in the wavelet coefficient space, along with an edge-preserving wavelet downsampling block (EPWDB) to retain fine details and textures during downsampling. A thorough assessment of our method on real-world (UCCS, U45, SQUID) and synthetic (UIEB, UCDD) datasets, along with profound ablation studies, validates its edge over existing techniques. Further, we have evaluated our method for tasks such as depth estimation, low-light enhancement and deblurring, demonstrating its versatility and broad applicability across various image processing tasks.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"7 4","pages":"2196-2207"},"PeriodicalIF":0.0,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147578985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Securing IoT: Unveiling Attacks With Multiview-Multitask Learning","authors":"Urkhimbam Boby Clinton;Nazrul Hoque;Shahid Raza;Monowar Bhuyan","doi":"10.1109/TAI.2025.3615565","DOIUrl":"https://doi.org/10.1109/TAI.2025.3615565","url":null,"abstract":"With the rapid expansion and use of day-to-day Internet of Things (IoT) applications, cybercriminals are exploiting an increasingly wide attack surface and are capable of executing many successful attacks, leading to significant losses for individuals and organizations. The existing defense mechanisms often rely on single-view, single-task models that utilize a single feature set to perform one specific task. However, modern IoT systems are multifaceted, heterogeneous, and resource-constrained, posing considerable challenges for developing unified and scalable defense solutions. To address this, we propose a novel hybrid model called M<inline-formula><tex-math>${}^{2}$</tex-math></inline-formula>VT that integrates multiview learning (MVL) and multitask learning (MTL) for effective cyber-attack defense. The model simultaneously processes three distinct subsets of relevant features (views) to perform three interrelated tasks: attack detection, attack category classification, and attack type classification. The model leverages autoencoder (AE) and long short-term memory (LSTM) networks to extract task-specific spatial and temporal features, enhancing both efficiency and cost-effectiveness. The M<inline-formula><tex-math>${}^{2}$</tex-math></inline-formula>VT model is evaluated using four publicly available IoT benchmark datasets and one testbed dataset. Across all tasks and datasets, the model achieves over 96% accuracy, consistently outperforming state-of-the-art approaches. The parallel execution of MVL and MTL, combined with task-specific feature subsets, significantly boosts performance. The implementation of M<inline-formula><tex-math>${}^{2}$</tex-math></inline-formula>VT is publicly available in our code repository.<xref><sup>1</sup></xref><fn><label><sup>1</sup></label><p><uri>https://github.com/3Clinton/MVMTL</uri></p></fn>","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"7 4","pages":"2332-2345"},"PeriodicalIF":0.0,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147578987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Point Cloud Simplification Method Based on Frequency Domain Coding","authors":"Zhou Wu;Tianze Chen;Dongsheng Li;Jiepeng Liu;Hongxu Wang;Pengkun Liu;Yuancheng Qi","doi":"10.1109/TAI.2025.3613685","DOIUrl":"https://doi.org/10.1109/TAI.2025.3613685","url":null,"abstract":"With the development of intelligent construction, three-dimensional (3-D) point cloud data (PCD) simplification has played an important role in back-end scenario analysis, reducing the computational and storage burden. However, the existing simplification methods are not applicable to the modeling analysis tasks in the architecture, engineering and construction industry. To address this issue, this study proposes a new simplification method, named as the frequency domain coding-based maximum difference simplification (FDCMDS). The FDCMDS is able to convert PCD into frequency domain multidimensional features to capture fine-grained structural variations of PCD. To improve the simplification efficiency, a 3-D Canny key point detection combined with PCD gradient is proposed as a key point extraction module. Finally, a method for evaluating PCD density is designed by combining existing metrics. The validation experiments on PCD with different density distributions and volumes prove the effectiveness and feasibility of the proposed method.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"7 4","pages":"2208-2224"},"PeriodicalIF":0.0,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147578996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xu Wang;Dong Pang;Zhiyuan You;Xinping Guan;Xinyi Le
{"title":"S-NODE: Explicit and Reversible Image Translation Encoding With Neural ODEs","authors":"Xu Wang;Dong Pang;Zhiyuan You;Xinping Guan;Xinyi Le","doi":"10.1109/TAI.2025.3619457","DOIUrl":"https://doi.org/10.1109/TAI.2025.3619457","url":null,"abstract":"Score-based diffusion models achieve high-quality data generation through an iterative denoising process. However, the stochastic term in the diffusion process prevents them from accomplishing reversible generative modeling. To tackle this problem, we present S-NODE, a novel generative model capable of reversible and conditional data generation. Unlike score-based models, S-NODE is an entirely deterministic generative method bridging the score function and neural ordinary differential equations (ODEs). First, we propose and prove an ODE utilizing a score-related difference as the drift term to model transformations between two certain data distributions. Second, we suggest a path-constrained loss to reduce truncation errors, enhancing the model’s capabilities in generating high-quality samples. Third, S-NODE can use a single conditional model to generate and translate cross-class images in all stages without additional training. Extensive experiments on various tasks demonstrate the effectiveness and reversibility of our method. Compared with other ODE-based and score-based methods, S-NODE achieves superior performance (FID of 2.29 & IS of 9.96) on CIFAR-10 and facilitates reversible image translation and image interpolation on CelebA, MetFace, and AFHQ datasets.","PeriodicalId":73305,"journal":{"name":"IEEE transactions on artificial intelligence","volume":"7 4","pages":"2401-2411"},"PeriodicalIF":0.0,"publicationDate":"2026-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147579042","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}