HiLM-D: Enhancing MLLMs with Multi-scale High-Resolution Details for Autonomous Driving

IF 9.3 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision Pub Date : 2025-05-07 DOI:10.1007/s11263-025-02433-3

Xinpeng Ding, Jianhua Han, Hang Xu, Wei Zhang, Xiaomeng Li

{"title":"HiLM-D: Enhancing MLLMs with Multi-scale High-Resolution Details for Autonomous Driving","authors":"Xinpeng Ding, Jianhua Han, Hang Xu, Wei Zhang, Xiaomeng Li","doi":"10.1007/s11263-025-02433-3","DOIUrl":null,"url":null,"abstract":"Recent efforts to use natural language for interpretable driving focus mainly on planning, neglecting perception tasks. In this paper, we address this gap by introducing ROLISP (Risk Object Localization and Intention and Suggestion Prediction), which towards interpretable risk object detection and suggestion for ego car motions. Accurate ROLISP implementation requires extensive reasoning to identify critical traffic objects and infer their intentions, prompting us to explore the capabilities of multimodal large language models (MLLMs). However, the limited perception performance of CLIP-ViT vision encoders in existing MLLMs struggles with capturing essential visual perception information, e.g., high-resolution, multi-scale and visual-related inductive biases, which are important for autonomous driving. Addressing these challenges, we introduce HiLM-D, a resource-efficient framework that enhances visual information processing in MLLMs for ROLISP. Our method is motivated by the fact that the primary variations in autonomous driving scenarios are the motion trajectories rather than the semantic or appearance information (e.g., the shapes and colors) of objects. Hence, the visual process of HiLM-D is a two-stream framework: (i) a temporal reasoning stream, receiving low-resolution dynamic video content, to capture temporal semantics, and (ii) a spatial perception stream, receiving a single high-resolution frame, to capture holistic visual perception-related information. The spatial perception stream can be made very lightweight by a well-designed P-Adapter, which is lightweight, training-efficient, and easily integrated into existing MLLMs. Experiments on the DRAMA-ROLISP dataset show HiLM-D’s significant improvements over current MLLMs, with a \\(3.7\\%\\) in BLEU-4 for captioning and \\(8.7\\%\\) in mIoU for detection. Further tests on the Shikra-RD dataset confirm our method’s generalization capabilities. The DRAMA-ROLISP is available at https://github.com/xmed-lab/HiLM-D.","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"20 1","pages":""},"PeriodicalIF":9.3000,"publicationDate":"2025-05-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computer Vision","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11263-025-02433-3","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Recent efforts to use natural language for interpretable driving focus mainly on planning, neglecting perception tasks. In this paper, we address this gap by introducing ROLISP (Risk Object Localization and Intention and Suggestion Prediction), which towards interpretable risk object detection and suggestion for ego car motions. Accurate ROLISP implementation requires extensive reasoning to identify critical traffic objects and infer their intentions, prompting us to explore the capabilities of multimodal large language models (MLLMs). However, the limited perception performance of CLIP-ViT vision encoders in existing MLLMs struggles with capturing essential visual perception information, e.g., high-resolution, multi-scale and visual-related inductive biases, which are important for autonomous driving. Addressing these challenges, we introduce HiLM-D, a resource-efficient framework that enhances visual information processing in MLLMs for ROLISP. Our method is motivated by the fact that the primary variations in autonomous driving scenarios are the motion trajectories rather than the semantic or appearance information (e.g., the shapes and colors) of objects. Hence, the visual process of HiLM-D is a two-stream framework: (i) a temporal reasoning stream, receiving low-resolution dynamic video content, to capture temporal semantics, and (ii) a spatial perception stream, receiving a single high-resolution frame, to capture holistic visual perception-related information. The spatial perception stream can be made very lightweight by a well-designed P-Adapter, which is lightweight, training-efficient, and easily integrated into existing MLLMs. Experiments on the DRAMA-ROLISP dataset show HiLM-D’s significant improvements over current MLLMs, with a \(3.7\%\) in BLEU-4 for captioning and \(8.7\%\) in mIoU for detection. Further tests on the Shikra-RD dataset confirm our method’s generalization capabilities. The DRAMA-ROLISP is available at https://github.com/xmed-lab/HiLM-D.

查看原文本刊更多论文

HiLM-D：基于自动驾驶的多尺度高分辨率细节增强mlm

最近使用自然语言进行可解释驾驶的努力主要集中在规划上，而忽略了感知任务。在本文中，我们通过引入ROLISP（风险对象定位和意图和建议预测）来解决这一差距，该方法旨在为自我汽车运动提供可解释的风险对象检测和建议。准确的ROLISP实现需要广泛的推理来识别关键流量对象并推断其意图，这促使我们探索多模态大语言模型（mllm）的功能。然而，现有mllm中CLIP-ViT视觉编码器的感知性能有限，难以捕获基本的视觉感知信息，例如高分辨率、多尺度和视觉相关的诱导偏差，这些对自动驾驶很重要。为了解决这些挑战，我们引入了HiLM-D，这是一个资源高效的框架，可以增强用于ROLISP的mllm中的视觉信息处理。我们的方法是基于这样一个事实，即自动驾驶场景中的主要变化是运动轨迹，而不是物体的语义或外观信息（例如，形状和颜色）。因此，HiLM-D的视觉过程是一个两流框架：(i)时间推理流，接收低分辨率动态视频内容，以捕获时间语义；（ii）空间感知流，接收单个高分辨率帧，以捕获整体视觉感知相关信息。通过精心设计的P-Adapter，空间感知流可以变得非常轻量化，它是轻量化的，训练效率高，并且很容易集成到现有的mllm中。在DRAMA-ROLISP数据集上的实验表明，HiLM-D比当前的mllm有了显著的改进，BLEU-4中的\(3.7\%\)用于字幕，mIoU中的\(8.7\%\)用于检测。在Shikra-RD数据集上的进一步测试证实了我们的方法的泛化能力。DRAMA-ROLISP可在https://github.com/xmed-lab/HiLM-D上获得。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Computer Vision 工程技术-计算机：人工智能

CiteScore

29.80

自引率

2.10%

发文量

163

审稿时长

6 months

期刊介绍： The International Journal of Computer Vision (IJCV) serves as a platform for sharing new research findings in the rapidly growing field of computer vision. It publishes 12 issues annually and presents high-quality, original contributions to the science and engineering of computer vision. The journal encompasses various types of articles to cater to different research outputs. Regular articles, which span up to 25 journal pages, focus on significant technical advancements that are of broad interest to the field. These articles showcase substantial progress in computer vision. Short articles, limited to 10 pages, offer a swift publication path for novel research outcomes. They provide a quicker means for sharing new findings with the computer vision community. Survey articles, comprising up to 30 pages, offer critical evaluations of the current state of the art in computer vision or offer tutorial presentations of relevant topics. These articles provide comprehensive and insightful overviews of specific subject areas. In addition to technical articles, the journal also includes book reviews, position papers, and editorials by prominent scientific figures. These contributions serve to complement the technical content and provide valuable perspectives. The journal encourages authors to include supplementary material online, such as images, video sequences, data sets, and software. This additional material enhances the understanding and reproducibility of the published research. Overall, the International Journal of Computer Vision is a comprehensive publication that caters to researchers in this rapidly growing field. It covers a range of article types, offers additional online resources, and facilitates the dissemination of impactful research.