A survey on multimodal large language models.

IF 17.1 1区综合性期刊 Q1 MULTIDISCIPLINARY SCIENCES

National Science Review Pub Date : 2024-11-12 eCollection Date: 2024-12-01 DOI:10.1093/nsr/nwae403

Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, Enhong Chen

{"title":"A survey on multimodal large language models.","authors":"Shukang Yin, Chaoyou Fu, Sirui Zhao, Ke Li, Xing Sun, Tong Xu, Enhong Chen","doi":"10.1093/nsr/nwae403","DOIUrl":null,"url":null,"abstract":"<p><p>Recently, the multimodal large language model (MLLM) represented by GPT-4V has been a new rising research hotspot, which uses powerful large language models (LLMs) as a brain to perform multimodal tasks. The surprising emergent capabilities of the MLLM, such as writing stories based on images and optical character recognition-free math reasoning, are rare in traditional multimodal methods, suggesting a potential path to artificial general intelligence. To this end, both academia and industry have endeavored to develop MLLMs that can compete with or even outperform GPT-4V, pushing the limit of research at a surprising speed. In this paper, we aim to trace and summarize the recent progress of MLLMs. First, we present the basic formulation of the MLLM and delineate its related concepts, including architecture, training strategy and data, as well as evaluation. Then, we introduce research topics about how MLLMs can be extended to support more granularity, modalities, languages and scenarios. We continue with multimodal hallucination and extended techniques, including multimodal in-context learning, multimodal chain of thought and LLM-aided visual reasoning. To conclude the paper, we discuss existing challenges and point out promising research directions.</p>","PeriodicalId":18842,"journal":{"name":"National Science Review","volume":"11 12","pages":"nwae403"},"PeriodicalIF":17.1000,"publicationDate":"2024-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11645129/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"National Science Review","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1093/nsr/nwae403","RegionNum":1,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/12/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

Recently, the multimodal large language model (MLLM) represented by GPT-4V has been a new rising research hotspot, which uses powerful large language models (LLMs) as a brain to perform multimodal tasks. The surprising emergent capabilities of the MLLM, such as writing stories based on images and optical character recognition-free math reasoning, are rare in traditional multimodal methods, suggesting a potential path to artificial general intelligence. To this end, both academia and industry have endeavored to develop MLLMs that can compete with or even outperform GPT-4V, pushing the limit of research at a surprising speed. In this paper, we aim to trace and summarize the recent progress of MLLMs. First, we present the basic formulation of the MLLM and delineate its related concepts, including architecture, training strategy and data, as well as evaluation. Then, we introduce research topics about how MLLMs can be extended to support more granularity, modalities, languages and scenarios. We continue with multimodal hallucination and extended techniques, including multimodal in-context learning, multimodal chain of thought and LLM-aided visual reasoning. To conclude the paper, we discuss existing challenges and point out promising research directions.

Abstract Image

查看原文本刊更多论文

多模态大语言模型研究综述。

最近，以GPT-4V为代表的多模态大语言模型（MLLM）成为一个新崛起的研究热点，它将强大的大语言模型（LLM）作为大脑来执行多模态任务。MLLM 令人惊奇的新兴能力，如根据图像编写故事和免光学字符识别数学推理等，在传统多模态方法中十分罕见，为人工通用智能的发展提供了一条潜在的道路。为此，学术界和产业界都在努力开发能与 GPT-4V 竞争甚至超越 GPT-4V 的 MLLM，以惊人的速度突破研究极限。本文旨在追溯和总结 MLLM 的最新进展。首先，我们介绍了 MLLM 的基本表述，并划分了其相关概念，包括架构、训练策略和数据以及评估。然后，我们介绍有关如何扩展 MLLM 以支持更多粒度、模式、语言和场景的研究课题。我们将继续介绍多模态幻觉和扩展技术，包括多模态情境学习、多模态思维链和 LLM 辅助视觉推理。最后，我们讨论了现有的挑战，并指出了有前景的研究方向。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

National Science Review MULTIDISCIPLINARY SCIENCES-

CiteScore

24.10

自引率

1.90%

发文量

249

审稿时长

13 weeks

期刊介绍： National Science Review (NSR; ISSN abbreviation: Natl. Sci. Rev.) is an English-language peer-reviewed multidisciplinary open-access scientific journal published by Oxford University Press under the auspices of the Chinese Academy of Sciences.According to Journal Citation Reports, its 2021 impact factor was 23.178. National Science Review publishes both review articles and perspectives as well as original research in the form of brief communications and research articles.