Visual Instruction Tuning towards General-Purpose Multimodal Large Language Model: A Survey

IF 9.3 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

International Journal of Computer Vision Pub Date : 2025-08-30 DOI:10.1007/s11263-025-02572-7

Jiaxing Huang, Jingyi Zhang, Kai Jiang, Han Qiu, Xiaoqin Zhang, Ling Shao, Shijian Lu, Dacheng Tao

{"title":"Visual Instruction Tuning towards General-Purpose Multimodal Large Language Model: A Survey","authors":"Jiaxing Huang, Jingyi Zhang, Kai Jiang, Han Qiu, Xiaoqin Zhang, Ling Shao, Shijian Lu, Dacheng Tao","doi":"10.1007/s11263-025-02572-7","DOIUrl":null,"url":null,"abstract":"<p>Traditional computer vision generally solves each single task independently by a specialist model with the task instruction implicitly considered and designed in the model architecture. This simply leads to two constraints in: (1) task-specific models where each model is trained for one specific task, hindering its scalability and synergy across diverse tasks; (2) pre-defined and fixed model interfaces that have limited interactivity and adaptability in following user’s task instructions. Visual Instruction Tuning (VIT), which learns from a wide range of vision tasks as described by natural language instructions, has recently been intensively studied to mitigate the constraints of specialist models. It fine-tunes a large vision model with natural language as general task instructions, aiming for a general-purpose multimodal large language model (MLLM) that can follow various language instructions and potentially solve various user-specified vision tasks. This work aims to provide a systematic and comprehensive review of visual instruction tuning that covers six key aspects including: (1) the background of vision task paradigm and its development towards VIT; (2) the foundations of VIT including commonly used network architectures, visual instruction tuning frameworks and objectives, as well as evaluation setups and tasks; (3) widely adopted benchmarks in visual instruction tuning and evaluations; (4) a thorough review of existing VIT techniques as categorized by both vision tasks and method designs, highlighting their major contributions, strengths, as well as constraints; (5) comparison and discussion of VIT methods over various instruction-following benchmarks; (6) challenges, possible research directions and research topics in the future visual instruction tuning study. A project associated with this work has been created at [link].</p>","PeriodicalId":13752,"journal":{"name":"International Journal of Computer Vision","volume":"24 1","pages":""},"PeriodicalIF":9.3000,"publicationDate":"2025-08-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Computer Vision","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11263-025-02572-7","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Traditional computer vision generally solves each single task independently by a specialist model with the task instruction implicitly considered and designed in the model architecture. This simply leads to two constraints in: (1) task-specific models where each model is trained for one specific task, hindering its scalability and synergy across diverse tasks; (2) pre-defined and fixed model interfaces that have limited interactivity and adaptability in following user’s task instructions. Visual Instruction Tuning (VIT), which learns from a wide range of vision tasks as described by natural language instructions, has recently been intensively studied to mitigate the constraints of specialist models. It fine-tunes a large vision model with natural language as general task instructions, aiming for a general-purpose multimodal large language model (MLLM) that can follow various language instructions and potentially solve various user-specified vision tasks. This work aims to provide a systematic and comprehensive review of visual instruction tuning that covers six key aspects including: (1) the background of vision task paradigm and its development towards VIT; (2) the foundations of VIT including commonly used network architectures, visual instruction tuning frameworks and objectives, as well as evaluation setups and tasks; (3) widely adopted benchmarks in visual instruction tuning and evaluations; (4) a thorough review of existing VIT techniques as categorized by both vision tasks and method designs, highlighting their major contributions, strengths, as well as constraints; (5) comparison and discussion of VIT methods over various instruction-following benchmarks; (6) challenges, possible research directions and research topics in the future visual instruction tuning study. A project associated with this work has been created at [link].

查看原文本刊更多论文

面向通用多模态大语言模型的视觉教学调优研究

传统的计算机视觉一般由一个专家模型独立地解决每个单个任务，任务指令隐式地考虑和设计在模型体系结构中。这只会导致两个约束：(1)特定于任务的模型，其中每个模型都针对一个特定任务进行训练，阻碍了它在不同任务之间的可扩展性和协同性；(2)预定义的固定模型接口，其交互性和对用户任务指令的适应性有限。视觉指令调整（Visual Instruction Tuning， VIT）是一种从自然语言指令描述的广泛视觉任务中学习的方法，最近被广泛研究以减轻专家模型的限制。以自然语言作为通用任务指令对大型视觉模型进行微调，旨在建立一个通用的多模态大型语言模型（MLLM），该模型可以遵循各种语言指令，并有可能解决各种用户指定的视觉任务。本文主要从以下六个方面对视觉教学调优进行了系统、全面的综述：(1)视觉任务范式的产生背景及其向视觉任务调优方向的发展；(2) VIT的基础，包括常用的网络架构、可视化教学调整框架和目标，以及评估设置和任务；(3)在视觉教学调优和评价中被广泛采用的基准；(4)根据视觉任务和方法设计对现有VIT技术进行全面回顾，突出其主要贡献、优势和限制；(5)在各种指令遵循基准上对VIT方法进行比较和讨论；(6)未来视觉教学调优研究面临的挑战、可能的研究方向和研究课题。与这项工作相关的一个项目已经在[链接]创建。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Computer Vision 工程技术-计算机：人工智能

CiteScore

29.80

自引率

2.10%

发文量

163

审稿时长

6 months

期刊介绍： The International Journal of Computer Vision (IJCV) serves as a platform for sharing new research findings in the rapidly growing field of computer vision. It publishes 12 issues annually and presents high-quality, original contributions to the science and engineering of computer vision. The journal encompasses various types of articles to cater to different research outputs. Regular articles, which span up to 25 journal pages, focus on significant technical advancements that are of broad interest to the field. These articles showcase substantial progress in computer vision. Short articles, limited to 10 pages, offer a swift publication path for novel research outcomes. They provide a quicker means for sharing new findings with the computer vision community. Survey articles, comprising up to 30 pages, offer critical evaluations of the current state of the art in computer vision or offer tutorial presentations of relevant topics. These articles provide comprehensive and insightful overviews of specific subject areas. In addition to technical articles, the journal also includes book reviews, position papers, and editorials by prominent scientific figures. These contributions serve to complement the technical content and provide valuable perspectives. The journal encourages authors to include supplementary material online, such as images, video sequences, data sets, and software. This additional material enhances the understanding and reproducibility of the published research. Overall, the International Journal of Computer Vision is a comprehensive publication that caters to researchers in this rapidly growing field. It covers a range of article types, offers additional online resources, and facilitates the dissemination of impactful research.