LMEye: An Interactive Perception Network for Large Language Models

IF 8.4 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2024-07-15 DOI:10.1109/TMM.2024.3428317

Yunxin Li;Baotian Hu;Xinyu Chen;Lin Ma;Yong Xu;Min Zhang

{"title":"LMEye: An Interactive Perception Network for Large Language Models","authors":"Yunxin Li;Baotian Hu;Xinyu Chen;Lin Ma;Yong Xu;Min Zhang","doi":"10.1109/TMM.2024.3428317","DOIUrl":null,"url":null,"abstract":"Current efficient approaches to building Multimodal Large Language Models (MLLMs) mainly incorporate visual information into LLMs with a simple visual mapping network such as a linear projection layer, a multilayer perceptron (MLP), or Q-former from BLIP-2. Such networks project the image feature once and do not consider the interaction between the image and the human inputs. Hence, the obtained visual information without being connected to human intention may be inadequate for LLMs to generate intention-following responses, which we refer to as static visual information. To alleviate this issue, our paper introduces LMEye, a human-like eye with a play-and-plug interactive perception network, designed to enable dynamic interaction between LLMs and external visual information. It can allow the LLM to request the desired visual information aligned with various human instructions, which we term dynamic visual information acquisition. Specifically, LMEye consists of a simple visual mapping network to provide the basic perception of an image for LLMs. It also contains additional modules responsible for acquiring requests from LLMs, performing request-based visual information seeking, and transmitting the resulting interacted visual information to LLMs, respectively. In this way, LLMs act to understand the human query, deliver the corresponding request to the request-based visual information interaction module, and generate the response based on the interleaved multimodal information. We evaluate LMEye through extensive experiments on multimodal benchmarks, demonstrating that it significantly improves zero-shot performances on various multimodal tasks compared to previous methods, with fewer parameters. Moreover, we also verify its effectiveness and scalability on various language models and video understanding, respectively.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"26 ","pages":"10952-10964"},"PeriodicalIF":8.4000,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10598361/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Current efficient approaches to building Multimodal Large Language Models (MLLMs) mainly incorporate visual information into LLMs with a simple visual mapping network such as a linear projection layer, a multilayer perceptron (MLP), or Q-former from BLIP-2. Such networks project the image feature once and do not consider the interaction between the image and the human inputs. Hence, the obtained visual information without being connected to human intention may be inadequate for LLMs to generate intention-following responses, which we refer to as static visual information. To alleviate this issue, our paper introduces LMEye, a human-like eye with a play-and-plug interactive perception network, designed to enable dynamic interaction between LLMs and external visual information. It can allow the LLM to request the desired visual information aligned with various human instructions, which we term dynamic visual information acquisition. Specifically, LMEye consists of a simple visual mapping network to provide the basic perception of an image for LLMs. It also contains additional modules responsible for acquiring requests from LLMs, performing request-based visual information seeking, and transmitting the resulting interacted visual information to LLMs, respectively. In this way, LLMs act to understand the human query, deliver the corresponding request to the request-based visual information interaction module, and generate the response based on the interleaved multimodal information. We evaluate LMEye through extensive experiments on multimodal benchmarks, demonstrating that it significantly improves zero-shot performances on various multimodal tasks compared to previous methods, with fewer parameters. Moreover, we also verify its effectiveness and scalability on various language models and video understanding, respectively.

查看原文本刊更多论文

LMEye：大型语言模型的交互式感知网络

目前构建多模态大语言模型（MLLM）的有效方法主要是通过简单的视觉映射网络（如线性投影层、多层感知器（MLP）或 BLIP-2 中的 Q-former 等）将视觉信息纳入大语言模型。这类网络只投射一次图像特征，不考虑图像与人类输入之间的交互作用。因此，所获得的视觉信息如果不与人的意图相联系，可能不足以让 LLM 产生意图跟踪响应，我们称之为静态视觉信息。为了缓解这一问题，我们在论文中介绍了 LMEye，它是一种具有玩插式交互感知网络的类人眼，旨在实现 LLM 与外部视觉信息之间的动态交互。它可以让 LLM 根据人类的各种指令请求所需的视觉信息，我们称之为动态视觉信息获取。具体来说，LMEye 由一个简单的视觉映射网络组成，为 LLM 提供对图像的基本感知。它还包含其他模块，分别负责从 LLMs 获取请求、执行基于请求的视觉信息搜索，以及将交互得到的视觉信息传输给 LLMs。这样，LLMs 就能理解人类的询问，向基于请求的视觉信息交互模块发送相应的请求，并根据交错的多模态信息生成响应。我们通过对多模态基准的广泛实验对 LMEye 进行了评估，结果表明，与以前的方法相比，LMEye 在各种多模态任务中的零拍摄性能都有显著提高，而且参数更少。此外，我们还分别验证了它在各种语言模型和视频理解方面的有效性和可扩展性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.