Imp：移动设备的高性能大型多模态模型

IF 9.7 1区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Multimedia Pub Date : 2025-04-11 DOI:10.1109/TMM.2025.3557680

Zhenwei Shao;Zhou Yu;Jun Yu;Xuecheng Ouyang;Lihao Zheng;Zhenbiao Gai;Mingyang Wang;Zhenzhong Kuang;Jiajun Ding

{"title":"Imp：移动设备的高性能大型多模态模型","authors":"Zhenwei Shao;Zhou Yu;Jun Yu;Xuecheng Ouyang;Lihao Zheng;Zhenbiao Gai;Mingyang Wang;Zhenzhong Kuang;Jiajun Ding","doi":"10.1109/TMM.2025.3557680","DOIUrl":null,"url":null,"abstract":"By harnessing the capabilities of large language models (LLMs), recent large multimodal models (LMMs) have shown remarkable versatility in open-world multimodal understanding. Nevertheless, they are usually parameter-heavy and computation-intensive, thus hindering their applicability in resource-constrained scenarios. To this end, several lightweight LMMs have been proposed successively to maximize the capabilities under constrained scale (e.g., 3B). Despite the encouraging results achieved by these methods, most of them only focus on one or two aspects of the design space, and the key design choices that influence model capability have not yet been thoroughly investigated. In this paper, we conduct a systematic study for lightweight LMMs from the aspects of model architecture, training strategy, and training data. Based on our findings, we obtain Imp—a family of highly capable LMMs at the 2B<inline-formula><tex-math>$\\sim$</tex-math></inline-formula>4B scales. Notably, our Imp-3B model steadily outperforms all the existing lightweight LMMs of similar size, and even surpasses the state-of-the-art LMMs at the 13B scale. With low-bit quantization and resolution reduction techniques, our Imp model can be deployed on a Qualcomm Snapdragon 8Gen3 mobile chip with a high inference speed of about 13 tokens/s.","PeriodicalId":13273,"journal":{"name":"IEEE Transactions on Multimedia","volume":"27 ","pages":"2961-2974"},"PeriodicalIF":9.7000,"publicationDate":"2025-04-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Imp: Highly Capable Large Multimodal Models for Mobile Devices\",\"authors\":\"Zhenwei Shao;Zhou Yu;Jun Yu;Xuecheng Ouyang;Lihao Zheng;Zhenbiao Gai;Mingyang Wang;Zhenzhong Kuang;Jiajun Ding\",\"doi\":\"10.1109/TMM.2025.3557680\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"By harnessing the capabilities of large language models (LLMs), recent large multimodal models (LMMs) have shown remarkable versatility in open-world multimodal understanding. Nevertheless, they are usually parameter-heavy and computation-intensive, thus hindering their applicability in resource-constrained scenarios. To this end, several lightweight LMMs have been proposed successively to maximize the capabilities under constrained scale (e.g., 3B). Despite the encouraging results achieved by these methods, most of them only focus on one or two aspects of the design space, and the key design choices that influence model capability have not yet been thoroughly investigated. In this paper, we conduct a systematic study for lightweight LMMs from the aspects of model architecture, training strategy, and training data. Based on our findings, we obtain Imp—a family of highly capable LMMs at the 2B<inline-formula><tex-math>$\\\\sim$</tex-math></inline-formula>4B scales. Notably, our Imp-3B model steadily outperforms all the existing lightweight LMMs of similar size, and even surpasses the state-of-the-art LMMs at the 13B scale. With low-bit quantization and resolution reduction techniques, our Imp model can be deployed on a Qualcomm Snapdragon 8Gen3 mobile chip with a high inference speed of about 13 tokens/s.\",\"PeriodicalId\":13273,\"journal\":{\"name\":\"IEEE Transactions on Multimedia\",\"volume\":\"27 \",\"pages\":\"2961-2974\"},\"PeriodicalIF\":9.7000,\"publicationDate\":\"2025-04-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Multimedia\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10962548/\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multimedia","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10962548/","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

通过利用大型语言模型（llm）的能力，最近的大型多模态模型（lmm）在开放世界的多模态理解中显示出显著的多功能性。然而，它们通常是大量参数和计算密集型的，因此阻碍了它们在资源受限场景中的适用性。为此，相继提出了几种轻量级lmm，以最大限度地发挥受限制规模下的能力（例如3B）。尽管这些方法取得了令人鼓舞的结果，但它们大多只关注设计空间的一个或两个方面，并且尚未对影响模型能力的关键设计选择进行深入研究。本文从模型架构、训练策略、训练数据等方面对轻量级lmm进行了系统的研究。基于我们的研究结果，我们获得了一个在2B$\sim$4B尺度上的高性能lmm家族。值得注意的是，我们的Imp-3B模型稳定地优于所有现有的类似尺寸的轻型lmm，甚至超过了最先进的13B级lmm。通过低比特量化和分辨率降低技术，我们的Imp模型可以部署在高通骁龙8Gen3移动芯片上，具有大约13个令牌/秒的高推断速度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Imp: Highly Capable Large Multimodal Models for Mobile Devices

By harnessing the capabilities of large language models (LLMs), recent large multimodal models (LMMs) have shown remarkable versatility in open-world multimodal understanding. Nevertheless, they are usually parameter-heavy and computation-intensive, thus hindering their applicability in resource-constrained scenarios. To this end, several lightweight LMMs have been proposed successively to maximize the capabilities under constrained scale (e.g., 3B). Despite the encouraging results achieved by these methods, most of them only focus on one or two aspects of the design space, and the key design choices that influence model capability have not yet been thoroughly investigated. In this paper, we conduct a systematic study for lightweight LMMs from the aspects of model architecture, training strategy, and training data. Based on our findings, we obtain Imp—a family of highly capable LMMs at the 2B

$\sim$

4B scales. Notably, our Imp-3B model steadily outperforms all the existing lightweight LMMs of similar size, and even surpasses the state-of-the-art LMMs at the 13B scale. With low-bit quantization and resolution reduction techniques, our Imp model can be deployed on a Qualcomm Snapdragon 8Gen3 mobile chip with a high inference speed of about 13 tokens/s.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Multimedia 工程技术-电信学

CiteScore

11.70

自引率

11.00%

发文量

576

审稿时长

5.5 months

期刊介绍： The IEEE Transactions on Multimedia delves into diverse aspects of multimedia technology and applications, covering circuits, networking, signal processing, systems, software, and systems integration. The scope aligns with the Fields of Interest of the sponsors, ensuring a comprehensive exploration of research in multimedia.