Railway-CLIP: A multimodal model for abnormal object detection in high-speed railway

Jiayu Zhang , Qingji Guan , Junbo Liu , Yaping Huang , Jianyong Guo
{"title":"Railway-CLIP: A multimodal model for abnormal object detection in high-speed railway","authors":"Jiayu Zhang ,&nbsp;Qingji Guan ,&nbsp;Junbo Liu ,&nbsp;Yaping Huang ,&nbsp;Jianyong Guo","doi":"10.1016/j.hspr.2025.06.001","DOIUrl":null,"url":null,"abstract":"<div><div>Automated detection of suspended anomalous objects on high-speed railway catenary systems using computer vision-based technology is a critical task for ensuring railway transportation safety. Despite the critical importance of this task, conventional vision-based foreign object detection methodologies have predominantly concentrated on image data, neglecting the exploration and integration of textual information. The currently popular multimodal model Contrastive Language-Image Pre-training (CLIP) employs contrastive learning to enable simultaneous understanding of both visual and textual modalities. Drawing inspiration from CLIP’s capabilities, this paper introduces a novel CLIP-based multimodal foreign object detection model tailored for railway applications, referred to as Railway-CLIP. This model leverages CLIP’s robust generalization capabilities to enhance performance in the context of catenary foreign object detection. The Railway-CLIP model is primarily composed of an image encoder and a text encoder. Initially, the Segment Anything Model (SAM) is employed to preprocess raw images, identifying candidate bounding boxes that may contain foreign objects. Both the original images and the detected candidate bounding boxes are subsequently fed into the image encoder to extract their respective visual features. In parallel, distinct prompt templates are crafted for both the original images and the candidate bounding boxes to serve as textual inputs. These prompts are then processed by the text encoder to derive textual features. The image and text encoders collaboratively project the multimodal features into a shared semantic space, facilitating the computation of similarity scores between visual and textual representations. The final detection results are determined based on these similarity scores, ensuring a robust and accurate identification of anomalous objects. Extensive experiments on our collected Railway Anomaly Dataset (RAD) demonstrate that the proposed Railway-CLIP outperforms previous state-of-the-art methods, achieving 97.25 % AUROC and 92.66 % <em>F</em><sub>1</sub>-score, thereby validating the effectiveness and superiority of the proposed approach in real-world high-speed railway anomaly detection scenarios.</div></div>","PeriodicalId":100607,"journal":{"name":"High-speed Railway","volume":"3 3","pages":"Pages 194-204"},"PeriodicalIF":0.0000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"High-speed Railway","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949867825000388","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Automated detection of suspended anomalous objects on high-speed railway catenary systems using computer vision-based technology is a critical task for ensuring railway transportation safety. Despite the critical importance of this task, conventional vision-based foreign object detection methodologies have predominantly concentrated on image data, neglecting the exploration and integration of textual information. The currently popular multimodal model Contrastive Language-Image Pre-training (CLIP) employs contrastive learning to enable simultaneous understanding of both visual and textual modalities. Drawing inspiration from CLIP’s capabilities, this paper introduces a novel CLIP-based multimodal foreign object detection model tailored for railway applications, referred to as Railway-CLIP. This model leverages CLIP’s robust generalization capabilities to enhance performance in the context of catenary foreign object detection. The Railway-CLIP model is primarily composed of an image encoder and a text encoder. Initially, the Segment Anything Model (SAM) is employed to preprocess raw images, identifying candidate bounding boxes that may contain foreign objects. Both the original images and the detected candidate bounding boxes are subsequently fed into the image encoder to extract their respective visual features. In parallel, distinct prompt templates are crafted for both the original images and the candidate bounding boxes to serve as textual inputs. These prompts are then processed by the text encoder to derive textual features. The image and text encoders collaboratively project the multimodal features into a shared semantic space, facilitating the computation of similarity scores between visual and textual representations. The final detection results are determined based on these similarity scores, ensuring a robust and accurate identification of anomalous objects. Extensive experiments on our collected Railway Anomaly Dataset (RAD) demonstrate that the proposed Railway-CLIP outperforms previous state-of-the-art methods, achieving 97.25 % AUROC and 92.66 % F1-score, thereby validating the effectiveness and superiority of the proposed approach in real-world high-speed railway anomaly detection scenarios.
railwayclip:高速铁路异常目标检测的多模态模型
利用计算机视觉技术对高速铁路接触网悬空异常物体进行自动检测是保障铁路运输安全的一项关键任务。尽管这项任务至关重要,但传统的基于视觉的异物检测方法主要集中在图像数据上,而忽略了对文本信息的探索和整合。目前流行的多模态模型对比语言图像预训练(CLIP)采用对比学习来同时理解视觉和文本模态。从CLIP的功能中汲取灵感,本文介绍了一种新的基于CLIP的多模式外来物体检测模型,该模型为铁路应用量身定制,称为railway -CLIP。该模型利用CLIP强大的泛化能力来提高接触网异物检测的性能。rail - clip模型主要由一个图像编码器和一个文本编码器组成。首先,使用分段任意模型(SAM)对原始图像进行预处理,识别可能包含外来物体的候选边界框。随后将原始图像和检测到的候选边界框送入图像编码器以提取其各自的视觉特征。同时,为原始图像和候选边界框制作不同的提示模板,作为文本输入。然后由文本编码器处理这些提示以派生文本特征。图像和文本编码器协同将多模态特征投影到共享的语义空间中,便于计算视觉和文本表示之间的相似度分数。最终的检测结果是基于这些相似度分数确定的,确保了对异常物体的鲁棒性和准确性识别。在我们收集的铁路异常数据集(RAD)上进行的大量实验表明,所提出的Railway- clip优于之前最先进的方法,达到97.25 % AUROC和92.66 % f1得分,从而验证了所提出方法在实际高速铁路异常检测场景中的有效性和优越性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信