Railway-CLIP: A multimodal model for abnormal object detection in high-speed railway

High-speed Railway Pub Date : 2025-09-01 DOI:10.1016/j.hspr.2025.06.001

Jiayu Zhang , Qingji Guan , Junbo Liu , Yaping Huang , Jianyong Guo

{"title":"Railway-CLIP: A multimodal model for abnormal object detection in high-speed railway","authors":"Jiayu Zhang , Qingji Guan , Junbo Liu , Yaping Huang , Jianyong Guo","doi":"10.1016/j.hspr.2025.06.001","DOIUrl":null,"url":null,"abstract":"<div><div>Automated detection of suspended anomalous objects on high-speed railway catenary systems using computer vision-based technology is a critical task for ensuring railway transportation safety. Despite the critical importance of this task, conventional vision-based foreign object detection methodologies have predominantly concentrated on image data, neglecting the exploration and integration of textual information. The currently popular multimodal model Contrastive Language-Image Pre-training (CLIP) employs contrastive learning to enable simultaneous understanding of both visual and textual modalities. Drawing inspiration from CLIP’s capabilities, this paper introduces a novel CLIP-based multimodal foreign object detection model tailored for railway applications, referred to as Railway-CLIP. This model leverages CLIP’s robust generalization capabilities to enhance performance in the context of catenary foreign object detection. The Railway-CLIP model is primarily composed of an image encoder and a text encoder. Initially, the Segment Anything Model (SAM) is employed to preprocess raw images, identifying candidate bounding boxes that may contain foreign objects. Both the original images and the detected candidate bounding boxes are subsequently fed into the image encoder to extract their respective visual features. In parallel, distinct prompt templates are crafted for both the original images and the candidate bounding boxes to serve as textual inputs. These prompts are then processed by the text encoder to derive textual features. The image and text encoders collaboratively project the multimodal features into a shared semantic space, facilitating the computation of similarity scores between visual and textual representations. The final detection results are determined based on these similarity scores, ensuring a robust and accurate identification of anomalous objects. Extensive experiments on our collected Railway Anomaly Dataset (RAD) demonstrate that the proposed Railway-CLIP outperforms previous state-of-the-art methods, achieving 97.25 % AUROC and 92.66 % <em>F</em><sub>1</sub>-score, thereby validating the effectiveness and superiority of the proposed approach in real-world high-speed railway anomaly detection scenarios.</div></div>","PeriodicalId":100607,"journal":{"name":"High-speed Railway","volume":"3 3","pages":"Pages 194-204"},"PeriodicalIF":0.0000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"High-speed Railway","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949867825000388","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Automated detection of suspended anomalous objects on high-speed railway catenary systems using computer vision-based technology is a critical task for ensuring railway transportation safety. Despite the critical importance of this task, conventional vision-based foreign object detection methodologies have predominantly concentrated on image data, neglecting the exploration and integration of textual information. The currently popular multimodal model Contrastive Language-Image Pre-training (CLIP) employs contrastive learning to enable simultaneous understanding of both visual and textual modalities. Drawing inspiration from CLIP’s capabilities, this paper introduces a novel CLIP-based multimodal foreign object detection model tailored for railway applications, referred to as Railway-CLIP. This model leverages CLIP’s robust generalization capabilities to enhance performance in the context of catenary foreign object detection. The Railway-CLIP model is primarily composed of an image encoder and a text encoder. Initially, the Segment Anything Model (SAM) is employed to preprocess raw images, identifying candidate bounding boxes that may contain foreign objects. Both the original images and the detected candidate bounding boxes are subsequently fed into the image encoder to extract their respective visual features. In parallel, distinct prompt templates are crafted for both the original images and the candidate bounding boxes to serve as textual inputs. These prompts are then processed by the text encoder to derive textual features. The image and text encoders collaboratively project the multimodal features into a shared semantic space, facilitating the computation of similarity scores between visual and textual representations. The final detection results are determined based on these similarity scores, ensuring a robust and accurate identification of anomalous objects. Extensive experiments on our collected Railway Anomaly Dataset (RAD) demonstrate that the proposed Railway-CLIP outperforms previous state-of-the-art methods, achieving 97.25 % AUROC and 92.66 % F₁-score, thereby validating the effectiveness and superiority of the proposed approach in real-world high-speed railway anomaly detection scenarios.

查看原文本刊更多论文

railwayclip：高速铁路异常目标检测的多模态模型

利用计算机视觉技术对高速铁路接触网悬空异常物体进行自动检测是保障铁路运输安全的一项关键任务。尽管这项任务至关重要，但传统的基于视觉的异物检测方法主要集中在图像数据上，而忽略了对文本信息的探索和整合。目前流行的多模态模型对比语言图像预训练（CLIP）采用对比学习来同时理解视觉和文本模态。从CLIP的功能中汲取灵感，本文介绍了一种新的基于CLIP的多模式外来物体检测模型，该模型为铁路应用量身定制，称为railway -CLIP。该模型利用CLIP强大的泛化能力来提高接触网异物检测的性能。rail - clip模型主要由一个图像编码器和一个文本编码器组成。首先，使用分段任意模型（SAM）对原始图像进行预处理，识别可能包含外来物体的候选边界框。随后将原始图像和检测到的候选边界框送入图像编码器以提取其各自的视觉特征。同时，为原始图像和候选边界框制作不同的提示模板，作为文本输入。然后由文本编码器处理这些提示以派生文本特征。图像和文本编码器协同将多模态特征投影到共享的语义空间中，便于计算视觉和文本表示之间的相似度分数。最终的检测结果是基于这些相似度分数确定的，确保了对异常物体的鲁棒性和准确性识别。在我们收集的铁路异常数据集（RAD）上进行的大量实验表明，所提出的Railway- clip优于之前最先进的方法，达到97.25 % AUROC和92.66 % f1得分，从而验证了所提出方法在实际高速铁路异常检测场景中的有效性和优越性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

High-speed Railway

自引率

0.00%

发文量