A novel and scalable multimodal large language model architecture Tool-MMGPT for future tool wear prediction in titanium alloy high-speed milling processes
IF 8.2 1区 计算机科学Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS
Caihua Hao , Zhaoyu Wang , Xinyong Mao , Songping He , Bin Li , Hongqi Liu , Fangyu Peng , Weiye Li
{"title":"A novel and scalable multimodal large language model architecture Tool-MMGPT for future tool wear prediction in titanium alloy high-speed milling processes","authors":"Caihua Hao , Zhaoyu Wang , Xinyong Mao , Songping He , Bin Li , Hongqi Liu , Fangyu Peng , Weiye Li","doi":"10.1016/j.compind.2025.104302","DOIUrl":null,"url":null,"abstract":"<div><div>Accurately predicting the future wear of cutting tools with variable geometric parameters remains a significant challenge. Existing methods lack the capability to model long-term temporal dependencies and predict future wear values—a key characteristic of world models. To address this challenge, we introduce the Tool-Multimodal Generative Pre-trained Transformer (Tool-MMGPT), a novel and scalable multimodal large language model (MLLM) architecture specifically designed for tool wear prediction. Tool-MMGPT pioneers the first tool wear world model by uniquely unifying multimodal data, extending beyond conventional static dimensions to incorporate dynamic temporal dimensions. This approach extracts modality-specific information and achieves shared spatiotemporal feature fusion through a cross-modal Transformer. Subsequently, alignment and joint interpretation occur within a unified representation space via a multimodal-language projector, which effectively accommodates the comprehensive input characteristics required by world models. This article proposes an effective cross-modal fusion module for vibration signals and images, aiming to fully leverage the advantages of multimodal information. Crucially, Tool-MMGPT transcends the limitations of traditional Large Language Models (LLMs) through an innovative yet generalizable method. By fundamentally reconstructing the output layer and redefining training objectives, we repurpose LLMs for numerical regression tasks, thereby establishing a novel bridge that connects textual representations to continuous numerical predictions. This enables the direct and accurate long-term forecasting of future wear time series. Extensive experiments conducted on a newly developed multimodal dataset for variable geometry tools demonstrate that Tool-MMGPT significantly outperforms state-of-the-art (SOTA) baseline methods. These results highlight the model's superior long-context modeling capabilities and illustrate its potential for effective deployment in environments with limited computational resources.</div></div>","PeriodicalId":55219,"journal":{"name":"Computers in Industry","volume":"169 ","pages":"Article 104302"},"PeriodicalIF":8.2000,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers in Industry","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0166361525000673","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0
Abstract
Accurately predicting the future wear of cutting tools with variable geometric parameters remains a significant challenge. Existing methods lack the capability to model long-term temporal dependencies and predict future wear values—a key characteristic of world models. To address this challenge, we introduce the Tool-Multimodal Generative Pre-trained Transformer (Tool-MMGPT), a novel and scalable multimodal large language model (MLLM) architecture specifically designed for tool wear prediction. Tool-MMGPT pioneers the first tool wear world model by uniquely unifying multimodal data, extending beyond conventional static dimensions to incorporate dynamic temporal dimensions. This approach extracts modality-specific information and achieves shared spatiotemporal feature fusion through a cross-modal Transformer. Subsequently, alignment and joint interpretation occur within a unified representation space via a multimodal-language projector, which effectively accommodates the comprehensive input characteristics required by world models. This article proposes an effective cross-modal fusion module for vibration signals and images, aiming to fully leverage the advantages of multimodal information. Crucially, Tool-MMGPT transcends the limitations of traditional Large Language Models (LLMs) through an innovative yet generalizable method. By fundamentally reconstructing the output layer and redefining training objectives, we repurpose LLMs for numerical regression tasks, thereby establishing a novel bridge that connects textual representations to continuous numerical predictions. This enables the direct and accurate long-term forecasting of future wear time series. Extensive experiments conducted on a newly developed multimodal dataset for variable geometry tools demonstrate that Tool-MMGPT significantly outperforms state-of-the-art (SOTA) baseline methods. These results highlight the model's superior long-context modeling capabilities and illustrate its potential for effective deployment in environments with limited computational resources.
期刊介绍:
The objective of Computers in Industry is to present original, high-quality, application-oriented research papers that:
• Illuminate emerging trends and possibilities in the utilization of Information and Communication Technology in industry;
• Establish connections or integrations across various technology domains within the expansive realm of computer applications for industry;
• Foster connections or integrations across diverse application areas of ICT in industry.