Traffic scene perception via multimodal large language model with data augmentation and efficient training strategy

IF 7.2 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Applied Soft Computing Pub Date : 2025-05-10 DOI:10.1016/j.asoc.2025.113210

Shuo Liu , Lei Shi , Yucheng Shi , Yufei Gao , Xiaole Sun

{"title":"Traffic scene perception via multimodal large language model with data augmentation and efficient training strategy","authors":"Shuo Liu , Lei Shi , Yucheng Shi , Yufei Gao , Xiaole Sun","doi":"10.1016/j.asoc.2025.113210","DOIUrl":null,"url":null,"abstract":"<div><div>Intelligent mobility, driven by advancements in deep learning and computing power, enhances transportation efficiency and societal connectivity, fostering economic and urban development. Current computer vision solutions often struggle to capture the complex details or understand the context within traffic scenes, limiting advanced intelligent mobility and raising safety concerns. Multimodal Large Language Models (MLLMs), by integrating linguistic and visual data, can aid vehicles and transportation systems in gaining a deeper understanding of the real-world traffic scenes, offering solutions to current challenges. Nevertheless, existing approaches predominantly employ MLLMs as instruments for querying and engaging with traffic infrastructure, rather than empowering these models to genuinely comprehend the traffic environment. This limitation curtails the potential of MLLMs and may even pose safety risks. In this paper, we first introduce a data augmentation framework designed to transform raw data into datasets suited for specific training objectives, thereby addressing issues related to data scarcity. Secondly, we propose a learning rate-based staged training strategy that segments the training process into distinct stages. This strategy involves deploying datasets targeted at various training objectives according to the patterns of parameter changes observed in different stages, thereby enhancing the training efficiency of the model. Utilizing these methods, we present InsightGPT, a model endowed with robust understanding and reasoning capabilities in traffic scenarios. In experiments conducted across six tasks, InsightGPT consistently outperforms baseline MLLMs in evaluating both the overall traffic scenes and individual objects within it, demonstrating its superior traffic comprehension and reasoning abilities. InsightGPT’s parameters and deployment details are available at <span><span>https://github.com/JinleLiu/InsightGPT</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":50737,"journal":{"name":"Applied Soft Computing","volume":"177 ","pages":"Article 113210"},"PeriodicalIF":7.2000,"publicationDate":"2025-05-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Soft Computing","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1568494625005216","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Intelligent mobility, driven by advancements in deep learning and computing power, enhances transportation efficiency and societal connectivity, fostering economic and urban development. Current computer vision solutions often struggle to capture the complex details or understand the context within traffic scenes, limiting advanced intelligent mobility and raising safety concerns. Multimodal Large Language Models (MLLMs), by integrating linguistic and visual data, can aid vehicles and transportation systems in gaining a deeper understanding of the real-world traffic scenes, offering solutions to current challenges. Nevertheless, existing approaches predominantly employ MLLMs as instruments for querying and engaging with traffic infrastructure, rather than empowering these models to genuinely comprehend the traffic environment. This limitation curtails the potential of MLLMs and may even pose safety risks. In this paper, we first introduce a data augmentation framework designed to transform raw data into datasets suited for specific training objectives, thereby addressing issues related to data scarcity. Secondly, we propose a learning rate-based staged training strategy that segments the training process into distinct stages. This strategy involves deploying datasets targeted at various training objectives according to the patterns of parameter changes observed in different stages, thereby enhancing the training efficiency of the model. Utilizing these methods, we present InsightGPT, a model endowed with robust understanding and reasoning capabilities in traffic scenarios. In experiments conducted across six tasks, InsightGPT consistently outperforms baseline MLLMs in evaluating both the overall traffic scenes and individual objects within it, demonstrating its superior traffic comprehension and reasoning abilities. InsightGPT’s parameters and deployment details are available at https://github.com/JinleLiu/InsightGPT.

查看原文本刊更多论文

基于数据增强和高效训练策略的多模态大语言模型交通场景感知

在深度学习和计算能力进步的推动下，智能出行提高了交通效率和社会连通性，促进了经济和城市发展。目前的计算机视觉解决方案往往难以捕捉复杂的细节或理解交通场景中的背景，这限制了先进的智能移动性，并引发了安全问题。多模式大语言模型（Multimodal Large Language Models，简称mllm）通过整合语言和视觉数据，可以帮助车辆和交通系统更深入地了解现实世界的交通场景，为当前的挑战提供解决方案。然而，现有的方法主要使用mlm作为查询和参与交通基础设施的工具，而不是赋予这些模型真正理解交通环境的能力。这种限制限制了传销的潜力，甚至可能带来安全风险。在本文中，我们首先介绍了一个数据增强框架，旨在将原始数据转换为适合特定训练目标的数据集，从而解决与数据稀缺性相关的问题。其次，我们提出了一种基于学习率的分阶段训练策略，将训练过程划分为不同的阶段。该策略是根据不同阶段观察到的参数变化规律，部署针对不同训练目标的数据集，从而提高模型的训练效率。利用这些方法，我们提出了InsightGPT，这是一个在交通场景中具有强大理解和推理能力的模型。在六个任务中进行的实验中，InsightGPT在评估整体交通场景和其中的单个对象方面始终优于基线mlm，证明了其优越的交通理解和推理能力。InsightGPT的参数和部署细节可在https://github.com/JinleLiu/InsightGPT上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Applied Soft Computing 工程技术-计算机：跨学科应用

CiteScore

15.80

自引率

6.90%

发文量

874

审稿时长

10.9 months

期刊介绍： Applied Soft Computing is an international journal promoting an integrated view of soft computing to solve real life problems.The focus is to publish the highest quality research in application and convergence of the areas of Fuzzy Logic, Neural Networks, Evolutionary Computing, Rough Sets and other similar techniques to address real world complexities. Applied Soft Computing is a rolling publication: articles are published as soon as the editor-in-chief has accepted them. Therefore, the web site will continuously be updated with new articles and the publication time will be short.