{"title":"Enhancing multi-level cross-modal interaction with false negative-aware contrastive learning for text-video retrieval","authors":"Eungyeop Kim, Changhee Lee","doi":"10.1007/s10489-025-06821-7","DOIUrl":"10.1007/s10489-025-06821-7","url":null,"abstract":"<div><p>Text-video retrieval (TVR) has become a crucial branch in multi-modal understanding tasks. Enhanced by CLIP, a well-known contrastive learning framework that connects text and image, TVR has made substantial progress, particularly in developing cross-grained methods that account for both coarse and fine granularity in text and video. Nonetheless, previous cross-grained approaches have overlooked two crucial aspects. First, they utilize text-agnostic video summaries by simply averaging frame-level embeddings, potentially failing to capture crucial frame-level information that is semantically relevant to the corresponding text. Second, these approaches employ contrastive learning that neglects the impact of false negatives containing semantically relevant information. To address the aforementioned aspects, we introduce a novel framework for TVR, referred to as <i>X-MLNet</i>, focusing on capturing multi-level cross-modal interactions across video and text. This is done by first incorporating cross-attention modules at various levels of granularity, ranging from fine-grained (i.e., frame/word-level) representations to coarse-grained (i.e., video/sentence-level) representations. Then, we apply a contrastive learning framework that utilizes a similarity score computed based on the multi-level cross-modal interactions, excluding potential false negatives based on intra-modal connectivity among samples. Our experiments on five real-world benchmark datasets, including MSRVTT, MSVD, LSMDC, ActivityNet, and DiDeMo, demonstrate state-of-the-art performance in both text-to-video and video-to-text retrieval tasks. Our code is available at https://github.com/celestialxevermore/X-VLNet.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 14","pages":""},"PeriodicalIF":3.5,"publicationDate":"2025-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144990538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A transfer learning-based fault diagnosis method for rolling bearings with variable operating conditions","authors":"Cunli Song, Xiaomeng Yuan","doi":"10.1007/s10489-025-06811-9","DOIUrl":"10.1007/s10489-025-06811-9","url":null,"abstract":"<div><p>Aiming at the problem that fault feature information cannot be completely extracted and it is difficult to obtain a large amount of sample data for fault labeling in real production life, we propose a transfer learning-based fault diagnosis method for rolling bearings with variable operating conditions. First, in order to make up for the single limitation of the feature extraction of the original vibration signal, a new feature signal is formed by fusing the time domain features on the basis of the original vibration signal, which is used as the input of the model, and a lightweight one-dimensional convolutional neural network(1d-CNN) is constructed, and an efficient channel attention mechanism is introduced to extract the fault features, so as to get the source domain diagnostic model. Then, according to the idea of transfer learning, the vibration signals under different working conditions are input into the fine-tuned model to realize the rolling bearing fault diagnosis under multiple working conditions. The results show that the method can realize migration under different working conditions and accurately and efficiently realize rolling bearing fault diagnosis.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 14","pages":""},"PeriodicalIF":3.5,"publicationDate":"2025-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144990539","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Weidong Wang, Yuxin Wu, Yang Song, Xuan Zhao, Yao Cui, Yuhan Fan, Yanbo Liu, Ziqi Lv
{"title":"GAN-CLC-DGSR: Generative adversarial network framework with contrastive learning classifier for simultaneous time series data generation and state recognition","authors":"Weidong Wang, Yuxin Wu, Yang Song, Xuan Zhao, Yao Cui, Yuhan Fan, Yanbo Liu, Ziqi Lv","doi":"10.1007/s10489-025-06856-w","DOIUrl":"10.1007/s10489-025-06856-w","url":null,"abstract":"<div><p>Accurate identification of abnormal states is crucial for the continuous stable operation of equipment and timely intervention. However, the scarcity of abnormal data leads to low recognition accuracy in traditional methods when handling the data imbalance problem. To address this issue, we propose a novel Generative Adversarial Network Framework with Contrastive Learning Classifier for Simultaneous Time Series Data Generation and State Recognition (GAN-CLC-DGSR). In this framework, the generator not only synthesizes realistic signals but also enables the conversion between different signal categories. In addition to the conventional discriminator used to distinguish real from fake data, we design a contrastive learning-based classification discriminator. This discriminator maps the time-domain and frequency-domain features of the signal to a unified space, capturing invariant characteristics of the signal. This aids the generator in producing samples with higher category distinguishability. The classification discriminator is also trained as a state recognizer. We conduct extensive experiments on the vibration screen dataset from a coal preparation plant, a bearing dataset, and an epilepsy dataset. The results demonstrate that the proposed method outperforms other comparative methods in both data generation and state recognition, and it exhibits strong generalization capability.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 14","pages":""},"PeriodicalIF":3.5,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144923078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianxin Wang, Jin Wang, Yi Li, Dongdong Ge, Siyuan Zhou
{"title":"Crack segmentation network based on hybrid-window transformer and dual-branch fusion","authors":"Jianxin Wang, Jin Wang, Yi Li, Dongdong Ge, Siyuan Zhou","doi":"10.1007/s10489-025-06822-6","DOIUrl":"10.1007/s10489-025-06822-6","url":null,"abstract":"<div><p>Cracks are external manifestations of infrastructure damage, and routine inspections are crucial for assessing their structural safety. However, due to factors such as crack diversity, background noise interference, and information loss, high-precision crack segmentation still faces numerous challenges. To alleviate the influence of these factors, a crack segmentation network based on hybrid-window Transformer and dual-branch fusion (CSHD) is proposed. The CSHD network can effectively capture local texture details and global context modeling to achieve high-precision crack segmentation. First, a hybrid-window attention mechanism(HWA) is designed as the core component, which employs a dual-branch parallel architecture to integrate channel attention and multi-scale depth-wise convolution modules on the value path of window attention, achieving spatial receptive field expansion and cross-window feature interaction. Second, to enhance feature processing capabilities, a locally enhanced gated FeedForward network (LeGN) is proposed, which achieves adaptive feature aggregation through overlapping multi-scale deformable convolution, and a gated unit is designed to optimize the information flow. Thirdly, a dual-branch fusion module (DBF) is introduced in skip-connections of encoder-decoder layers to enhance cross-level feature interaction while effectively mitigating information loss during the downsampling process. Finally, comparative experimental results on three benchmark datasets (CrackLS315, DeepCrack537, and YCD776) with seven advanced networks demonstrate that the proposed network achieves excellent performance, obtaining mean intersection over union (mIoU) scores of 71.26%, 86.58%, and 83.93%, respectively. Code is available at: https://github.com/wjxcsust2024/CSHD.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 13","pages":""},"PeriodicalIF":3.5,"publicationDate":"2025-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144920460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"VAHMSE: an efficient anomaly detection model based on variational autoencoder and heterogeneous multi-stacking ensemble learning","authors":"Rui Wang, Jiayao Li","doi":"10.1007/s10489-025-06845-z","DOIUrl":"10.1007/s10489-025-06845-z","url":null,"abstract":"<div><p>With the advent of the information age, data has become an important resource and production factor. However, the existence of abnormal data causes the lose of personal privacy, business operations and national security, therefore, anomaly detection has received increasing attention in recent years. Most existing anomaly detection models are based on machine learning or deep learning models, but the use of a single model leads to the problems such as overfitting, weak generalization and poor stability. Meanwhile, there is a serious data imbalance problem due to the significantly few number of abnormal data compared to normal data, which reduces the detection performance. To address these issues, this paper proposes an anomaly detection model called VAHMSE based on <i>v</i>ariational <i>a</i>utoencoder and <i>h</i>eterogeneous <i>m</i>ulti-<i>s</i>tacking <i>e</i>nsemble learning to improve the detection performance. In the data augmentation phase, the <i>v</i>ariational <i>a</i>uto<i>e</i>ncoder (VAE) is used to replace traditional oversampling and other class balancing techniques to solve the data imbalance problem, and the mutual information is added to the loss function of traditional VAE to solve the problem of posterior distribution collapsing to prior distribution, thereby improving the quality of data generation. In the anomaly detection phase, the heterogeneous multi-stacking ensemble learning-based anomaly detection method is proposed, where five machine learning models with good performance are selected as the base learners in the first layer stacking process, and the TCN is selected as the meta learner in the second layer stacking process; In addition, the Squeeze and Excitation module is integrated into the traditional TCN model to explicitly model the interdependence between convolutional feature channels and improve the representation ability of network. Extensive experiments on six widely used datasets show that compared with five state-of-the-art models, the proposed VAHMSE achieves better performance in accuracy, recall, precision and F1-score, and it also achieves better stability.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 13","pages":""},"PeriodicalIF":3.5,"publicationDate":"2025-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144920477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hybrid CNN-RWKV with high-frequency enhancement for real-world chinese-english scene text image super-resolution","authors":"Yanbin Liu, Yu Zhu, Hangyu Li, Xiaofeng Ling","doi":"10.1007/s10489-025-06785-8","DOIUrl":"10.1007/s10489-025-06785-8","url":null,"abstract":"<div><p>Existing scene text image super-resolution (STISR) methods primarily focus on the restoration of fixed-size English text images. Compared to English characters, Chinese characters present a greater variety of categories and more intricate stroke structures. In recent years, Transformer-based methods have achieved significant progress in image super-resolution task, but face the dilemma between global modeling and efficient computation. The emerging Receptance Weighted Key Value (RWKV) model can serve as a promising alternative to Transformer, enabling long-distance modeling with linear computational complexity. In this paper, we propose a Hybrid CNN-RWKV with High-Frequency Enhancement (HCR-HFE) model for STISR task. First, we design a recurrent bidirectional WKV (Re-Bi-WKV) attention which integrates bidirectional WKV (Bi-WKV) attention with a recurrent mechanism. Bi-WKV achieves global receptive field with linear complexity, while the recurrent mechanism establishes 2D image dependencies from different scanning directions. Additionally, a computationally efficient high-frequency enhancement module (HFEM) is incorporated to enhance high-frequency details, such as character edge information. Furthermore, we design a multi-scale large kernel convolutional (MLKC) block which integrates large kernel decomposition, gated aggregation and multi-scale mechanism to capture various-range dependencies with reduced computational cost. Finally, we introduce a multi-frequency channel attention (MFCA) which extends channel attention to the frequency domain, enabling the model to focus on critical features. Extensive experiments on real-world Chinese-English (Real-CE) dataset demonstrate that HCR-HFE outperforms previous methods in both quantitative metrics and visual results. Furthermore, HCR-HFE achieves excellent performance on natural image datasets, demonstrating its broad applicability.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 13","pages":""},"PeriodicalIF":3.5,"publicationDate":"2025-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144920459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhiwei Cai, Nuoying Xu, Linqin Cai, Bo Ren, Yu Xiong
{"title":"Educational knowledge graph based intelligent question answering for automatic control disciplines","authors":"Zhiwei Cai, Nuoying Xu, Linqin Cai, Bo Ren, Yu Xiong","doi":"10.1007/s10489-025-06847-x","DOIUrl":"10.1007/s10489-025-06847-x","url":null,"abstract":"<div><p>With the further development of education informatization, Educational Knowledge Graph (EKG) based intelligent Question Answering (KGQA) has attracted significant attention in smart education. However, current educational KGQA faces enormous challenges, such as the incomplete questions from students, the dispersed knowledge from EKG, and the scarce and imbalanced dataset. In this paper, a novel educational KGQA model was proposed for answering student’s questions on automatic control disciplines. Firstly, a topic entity detection algorithm was constructed based on BERT-BiLSTM-CRF and domain dictionary, and an intention recognition algorithm was built based on BERT and TextCNN to accurately locate the topic entity by formulating entity priority, entity completion rules, and similarity calculation. Then, a custom weighted cross-entropy loss function (CCL) was designed to alleviate the influence of imbalanced samples in the training dataset on the model classifier. In addition, the first Chinese dataset for educational KGQA in automatic control disciplines (ACKGQA) was constructed. Finally, extensive experiments are performed to evaluate the effectiveness and generalizations of the proposed KGQA model on the ACKGQA dataset and five benchmark public datasets. The proposed KGQA obtains the recognition precision of 87.5% and the recall of 86.25% on the ACKGQA dataset and exhibits better overall performance on other five benchmark datasets. Experimental results demonstrate that our educational KGQA model can achieve outstanding performance when facing the challenges posed by imbalanced datasets inherent in educational knowledge graphs.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 13","pages":""},"PeriodicalIF":3.5,"publicationDate":"2025-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144920461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Balanced Loss Function for Long-tailed Semi-supervised Ship Detection","authors":"Li-Ying Hao, Jia-Rui Yang, Yunze Zhang","doi":"10.1007/s10489-025-06838-y","DOIUrl":"10.1007/s10489-025-06838-y","url":null,"abstract":"<div><p>Semi-supervised learning (SSL) has significantly reduced the reliance of the ship detection network on labeled images. However, the more realistic and challenging issue of long-tailed distribution in SSL remains largely unexplored. While most existing methods address this issue at the instance level through reweighting or resampling techniques, their performance is significantly limited by their dependence on biased backbone representations. To overcome this limitation, we propose a Balanced Loss function (Bal Loss). Our approach consists of three key components. First, we introduce the BaCon Loss, which computes class-wise feature centers as positive anchors and selects negative anchors through a simple yet effective mechanism. Second, we posit an assumption that the normalized features in contrastive learning follow a mixture of von Mises-Fisher (vMF) distributions in the unit space. This assumption allows us to estimate the distribution parameters using only the first sample moment, which can be efficiently computed in an online manner across different batches. Finally, we incorporate a Jitter-Bagging module, adapted from prior literature, to provide precise localization information, thereby refining bounding box predictions. Extensive experiments demonstrate the efficacy of Bal Loss, achieving SOTA results on ship datasets with a 3.9 improvement over the baseline. Notably, our method attains an <span>(AP^{r})</span> of 44.1 on the ShipRSImageNet dataset, underscoring its robust detection capabilities.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 13","pages":""},"PeriodicalIF":3.5,"publicationDate":"2025-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144920462","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zengzhao Chen, Fumei Ma, Hai Liu, Wenkai Huang, Tingting Liu
{"title":"MDT: A multiscale differencing transformer with sequence feature relationship mining for robust action recognition","authors":"Zengzhao Chen, Fumei Ma, Hai Liu, Wenkai Huang, Tingting Liu","doi":"10.1007/s10489-025-06861-z","DOIUrl":"10.1007/s10489-025-06861-z","url":null,"abstract":"<div><p>Skeleton-based action recognition, which analyzes joint coordinates and bone connections to classify human actions, is important in understanding and analyzing human dynamic behaviors. However, actions in complex scenes have a high degree of similarity and variability, with the dynamic changes in human skeletons and subtle temporal variations in particular posing significant challenges to the accuracy and robustness of action recognition systems. To mitigate these challenges, we propose a novel multiscale differencing transformer (MDT) with sequence feature relationship mining for robust action recognition. MDT effectively mines inter-frame timing information and feature distribution differences across multiple scales, enabling a deeper understanding of the nuances between actions. Specifically, we first propose multiscale differential self-attention to handle the need for understanding action changes across multiple time scales, improving the capacity of the model to effectively capture the global and local dynamic features of actions. Then, we introduce a sequence feature relationship mining module to address complex data patterns in scenes that may span multiple sequences, exhibiting both similar and distinct characteristics. By utilizing coarse- and fine-grained sequence information, this module empowers the model to recognize intricate data patterns. On the NTU RGB+D 60 dataset, the proposed MDT model outperforms the recent STAR-Transformer by 1.6% on the Cross-Subject (CS) setting and 1.1% on the Cross-View (CV) setting, demonstrating its consistent effectiveness across different evaluation protocols.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 13","pages":""},"PeriodicalIF":3.5,"publicationDate":"2025-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144920518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ExQUAL: an explainable quantum machine learning classifier","authors":"Karuna Kadian, Sunita Garhwal, Ajay Kumar","doi":"10.1007/s10489-025-06732-7","DOIUrl":"10.1007/s10489-025-06732-7","url":null,"abstract":"<div><p>Quantum machine learning (QML) holds the potential to solve complex tasks that classical machine learning is unable to handle. QML is a promising and emerging field which is in the state of continuous development. This necessitates a deeper comprehension of the intricate black-box nature of the quantum machine learning models. To address this challenge, the incorporation of explainable artificial intelligence becomes imperative. This paper introduces a novel approach - Explainable Quantum Classifier (ExQUAL) to integrate the Local Interpretable Model-agnostic Explanations (LIME) framework and SHapley Additive exPlanations (SHAP) with the Pegasos Quantum Support Vector Machine (QSVM) model for classification tasks. ExQUAL provides a methodology to integrate these frameworks with both binary and multi-class classification tasks and provides both local and global explanations. This approach seeks to enhance transparency and interpretability while advancing the applicability and trustworthiness of quantum machine learning methodologies.</p></div>","PeriodicalId":8041,"journal":{"name":"Applied Intelligence","volume":"55 13","pages":""},"PeriodicalIF":3.5,"publicationDate":"2025-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144909751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}