{"title":"GFIA: Generative Fault Image Analysis via vision–language model its application to train bogie transmission system","authors":"Chunming Zhang , Yu Wang , Xinge You","doi":"10.1016/j.jvcir.2025.104482","DOIUrl":null,"url":null,"abstract":"<div><div>Multimedia fault analytics plays a critical role in industrial applications, ensuring safety and reliability. Previous studies have explored fault classification using either one-dimensional signals or two-dimensional images, while understanding fault types and providing appropriate responses remains challenging, especially for complex system failures. To step further in this field, we leverage the powerful reasoning and generative capabilities of Large Multimodal Models (LMMs) for the fault analysis, then transform multi-channel sensor signals from the system into structured grayscale images suitable for visual–language models. Additionally, a domain-specific, strongly supervised dataset is constructed, that is, the Bogie Transmission Unified Fault Dataset (BTU), which contains expert-curated fault types, causes, and solutions. By integrating both image and language modalities, we fine-tune a visual–language model, Generative Fault Image Analysis (GFIA), to enhance fault reasoning and interpretation. Extensive experiments on our BTU dataset demonstrate that GFIA achieves an average diagnostic accuracy exceeding 99.9% for motor faults, reaching 100% for gearbox faults, and exceeding 99.8% for leftaxlebox faults. The proposed GFIA model outperforms traditional deep-learning methods and state-of-the-art large language models, highlighting the effectiveness of vision–language integration for fault analysis.</div></div>","PeriodicalId":54755,"journal":{"name":"Journal of Visual Communication and Image Representation","volume":"111 ","pages":"Article 104482"},"PeriodicalIF":3.1000,"publicationDate":"2025-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Visual Communication and Image Representation","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1047320325000963","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Multimedia fault analytics plays a critical role in industrial applications, ensuring safety and reliability. Previous studies have explored fault classification using either one-dimensional signals or two-dimensional images, while understanding fault types and providing appropriate responses remains challenging, especially for complex system failures. To step further in this field, we leverage the powerful reasoning and generative capabilities of Large Multimodal Models (LMMs) for the fault analysis, then transform multi-channel sensor signals from the system into structured grayscale images suitable for visual–language models. Additionally, a domain-specific, strongly supervised dataset is constructed, that is, the Bogie Transmission Unified Fault Dataset (BTU), which contains expert-curated fault types, causes, and solutions. By integrating both image and language modalities, we fine-tune a visual–language model, Generative Fault Image Analysis (GFIA), to enhance fault reasoning and interpretation. Extensive experiments on our BTU dataset demonstrate that GFIA achieves an average diagnostic accuracy exceeding 99.9% for motor faults, reaching 100% for gearbox faults, and exceeding 99.8% for leftaxlebox faults. The proposed GFIA model outperforms traditional deep-learning methods and state-of-the-art large language models, highlighting the effectiveness of vision–language integration for fault analysis.
期刊介绍:
The Journal of Visual Communication and Image Representation publishes papers on state-of-the-art visual communication and image representation, with emphasis on novel technologies and theoretical work in this multidisciplinary area of pure and applied research. The field of visual communication and image representation is considered in its broadest sense and covers both digital and analog aspects as well as processing and communication in biological visual systems.