{"title":"大语言模型与视觉感知的有效整合:训练范式视角的研究","authors":"Xiaorui Ma, Haoran Xie, S. Joe Qin","doi":"10.1016/j.inffus.2025.103419","DOIUrl":null,"url":null,"abstract":"<div><div>Integrating Large Language Models (LLMs) with visual modalities has become a central focus in multimodal AI. However, the high computational cost associated with Vision Large Language Models (VLLMs) limits their accessibility, restricting broader use across research communities and real-world deployments. Based on a comprehensive review of 36 high-quality image-text VLLMs, this survey categorizes vision integration into three training paradigms, each employing distinct approaches to improve parameter efficiency. Single-stage Tuning combines pretraining with few-shot learning and achieves strong generalization using minimal labeled data by training only the Modality Integrator (MI). Two-stage Tuning enhances performance through instruction tuning, multi-task learning, or reinforcement learning while improving efficiency via selective MI training, reparameterization modules, and lightweight LLMs. Direct Adaptation skips pretraining and directly finetunes the model on vision-language tasks, achieving efficiency by embedding lightweight MIs into frozen LLMs. These training paradigms have enabled practical applications in areas such as visual assistance, mobile device deployment, medical analysis, agricultural monitoring, and autonomous driving under resource constraints. Despite these advances, each paradigm faces distinct limitations: Single-stage Tuning struggles with few-shot transfer, Two-stage Tuning remains computationally expensive, and Direct Adaptation shows limited generalization ability. Correspondingly, future progress will require more effective pretraining strategies for better few-shot transfer in Single-stage Tuning, optimized use of lightweight LLMs in Two-stage Tuning, and broader adoption of instruction tuning in Direct Adaptation to improve generalization under resource constraints.</div></div>","PeriodicalId":50367,"journal":{"name":"Information Fusion","volume":"125 ","pages":"Article 103419"},"PeriodicalIF":14.7000,"publicationDate":"2025-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Efficiently Integrate Large Language Models with Visual Perception: A Survey from the Training Paradigm Perspective\",\"authors\":\"Xiaorui Ma, Haoran Xie, S. Joe Qin\",\"doi\":\"10.1016/j.inffus.2025.103419\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Integrating Large Language Models (LLMs) with visual modalities has become a central focus in multimodal AI. However, the high computational cost associated with Vision Large Language Models (VLLMs) limits their accessibility, restricting broader use across research communities and real-world deployments. Based on a comprehensive review of 36 high-quality image-text VLLMs, this survey categorizes vision integration into three training paradigms, each employing distinct approaches to improve parameter efficiency. Single-stage Tuning combines pretraining with few-shot learning and achieves strong generalization using minimal labeled data by training only the Modality Integrator (MI). Two-stage Tuning enhances performance through instruction tuning, multi-task learning, or reinforcement learning while improving efficiency via selective MI training, reparameterization modules, and lightweight LLMs. Direct Adaptation skips pretraining and directly finetunes the model on vision-language tasks, achieving efficiency by embedding lightweight MIs into frozen LLMs. These training paradigms have enabled practical applications in areas such as visual assistance, mobile device deployment, medical analysis, agricultural monitoring, and autonomous driving under resource constraints. Despite these advances, each paradigm faces distinct limitations: Single-stage Tuning struggles with few-shot transfer, Two-stage Tuning remains computationally expensive, and Direct Adaptation shows limited generalization ability. Correspondingly, future progress will require more effective pretraining strategies for better few-shot transfer in Single-stage Tuning, optimized use of lightweight LLMs in Two-stage Tuning, and broader adoption of instruction tuning in Direct Adaptation to improve generalization under resource constraints.</div></div>\",\"PeriodicalId\":50367,\"journal\":{\"name\":\"Information Fusion\",\"volume\":\"125 \",\"pages\":\"Article 103419\"},\"PeriodicalIF\":14.7000,\"publicationDate\":\"2025-06-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Fusion\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1566253525004920\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Fusion","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1566253525004920","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
Efficiently Integrate Large Language Models with Visual Perception: A Survey from the Training Paradigm Perspective
Integrating Large Language Models (LLMs) with visual modalities has become a central focus in multimodal AI. However, the high computational cost associated with Vision Large Language Models (VLLMs) limits their accessibility, restricting broader use across research communities and real-world deployments. Based on a comprehensive review of 36 high-quality image-text VLLMs, this survey categorizes vision integration into three training paradigms, each employing distinct approaches to improve parameter efficiency. Single-stage Tuning combines pretraining with few-shot learning and achieves strong generalization using minimal labeled data by training only the Modality Integrator (MI). Two-stage Tuning enhances performance through instruction tuning, multi-task learning, or reinforcement learning while improving efficiency via selective MI training, reparameterization modules, and lightweight LLMs. Direct Adaptation skips pretraining and directly finetunes the model on vision-language tasks, achieving efficiency by embedding lightweight MIs into frozen LLMs. These training paradigms have enabled practical applications in areas such as visual assistance, mobile device deployment, medical analysis, agricultural monitoring, and autonomous driving under resource constraints. Despite these advances, each paradigm faces distinct limitations: Single-stage Tuning struggles with few-shot transfer, Two-stage Tuning remains computationally expensive, and Direct Adaptation shows limited generalization ability. Correspondingly, future progress will require more effective pretraining strategies for better few-shot transfer in Single-stage Tuning, optimized use of lightweight LLMs in Two-stage Tuning, and broader adoption of instruction tuning in Direct Adaptation to improve generalization under resource constraints.
期刊介绍:
Information Fusion serves as a central platform for showcasing advancements in multi-sensor, multi-source, multi-process information fusion, fostering collaboration among diverse disciplines driving its progress. It is the leading outlet for sharing research and development in this field, focusing on architectures, algorithms, and applications. Papers dealing with fundamental theoretical analyses as well as those demonstrating their application to real-world problems will be welcome.