TinyLVLM-eHub：面向大型视觉语言模型的综合高效评估

IF 5.7 3区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Big Data Pub Date : 2025-01-30 DOI:10.1109/TBDATA.2025.3536930

Wenqi Shao;Meng Lei;Yutao Hu;Peng Gao;Peng Xu;Kaipeng Zhang;Fanqing Meng;Siyuan Huang;Hongsheng Li;Yu Qiao;Ping Luo

{"title":"TinyLVLM-eHub：面向大型视觉语言模型的综合高效评估","authors":"Wenqi Shao;Meng Lei;Yutao Hu;Peng Gao;Peng Xu;Kaipeng Zhang;Fanqing Meng;Siyuan Huang;Hongsheng Li;Yu Qiao;Ping Luo","doi":"10.1109/TBDATA.2025.3536930","DOIUrl":null,"url":null,"abstract":"Large Vision-Language Models (LVLMs) have made significant strides in various multimodal tasks. Notably, GPT4V, Claude, Gemini, and others showcase exceptional multimodal capabilities, marked by profound comprehension and reasoning skills. This study introduces a comprehensive and efficient evaluation framework, TinyLVLM-eHub, to assess LVLMs’ performance, including proprietary models. TinyLVLM-eHub covers six key multimodal capabilities, such as visual perception, knowledge acquisition, reasoning, commonsense understanding, object hallucination, and embodied intelligence. The benchmark, utilizing 2.1K image-text pairs, provides a user-friendly and accessible platform for LVLM evaluation. The evaluation employs the ChatGPT Ensemble Evaluation (CEE) method, which improves alignment with human evaluation compared to word-matching approaches. Results reveal that closed-source API models like GPT4V and GeminiPro-V excel in most capabilities compared to previous open-source LVLMs, though they show some vulnerability in object hallucination. This evaluation underscores areas for LVLM improvement in real-world applications and serves as a foundational assessment for future multimodal advancements.","PeriodicalId":13106,"journal":{"name":"IEEE Transactions on Big Data","volume":"11 3","pages":"933-947"},"PeriodicalIF":5.7000,"publicationDate":"2025-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"TinyLVLM-eHub: Towards Comprehensive and Efficient Evaluation for Large Vision-Language Models\",\"authors\":\"Wenqi Shao;Meng Lei;Yutao Hu;Peng Gao;Peng Xu;Kaipeng Zhang;Fanqing Meng;Siyuan Huang;Hongsheng Li;Yu Qiao;Ping Luo\",\"doi\":\"10.1109/TBDATA.2025.3536930\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large Vision-Language Models (LVLMs) have made significant strides in various multimodal tasks. Notably, GPT4V, Claude, Gemini, and others showcase exceptional multimodal capabilities, marked by profound comprehension and reasoning skills. This study introduces a comprehensive and efficient evaluation framework, TinyLVLM-eHub, to assess LVLMs’ performance, including proprietary models. TinyLVLM-eHub covers six key multimodal capabilities, such as visual perception, knowledge acquisition, reasoning, commonsense understanding, object hallucination, and embodied intelligence. The benchmark, utilizing 2.1K image-text pairs, provides a user-friendly and accessible platform for LVLM evaluation. The evaluation employs the ChatGPT Ensemble Evaluation (CEE) method, which improves alignment with human evaluation compared to word-matching approaches. Results reveal that closed-source API models like GPT4V and GeminiPro-V excel in most capabilities compared to previous open-source LVLMs, though they show some vulnerability in object hallucination. This evaluation underscores areas for LVLM improvement in real-world applications and serves as a foundational assessment for future multimodal advancements.\",\"PeriodicalId\":13106,\"journal\":{\"name\":\"IEEE Transactions on Big Data\",\"volume\":\"11 3\",\"pages\":\"933-947\"},\"PeriodicalIF\":5.7000,\"publicationDate\":\"2025-01-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Big Data\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10858438/\",\"RegionNum\":3,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Big Data","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10858438/","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

大型视觉语言模型（LVLMs）在各种多模态任务中取得了重大进展。值得注意的是，GPT4V、Claude、Gemini和其他一些人展示了非凡的多模式能力，以深刻的理解和推理能力为标志。本研究引入了一个全面有效的评估框架，TinyLVLM-eHub，来评估lvlm的性能，包括专有模型。TinyLVLM-eHub涵盖6个关键的多模态能力，如视觉感知、知识获取、推理、常识理解、对象幻觉和具体智能。该基准使用2.1K图像-文本对，为LVLM评估提供了一个用户友好且可访问的平台。评估采用ChatGPT集成评估（CEE）方法，与单词匹配方法相比，该方法提高了与人类评估的一致性。结果显示，与之前的开源lvlm相比，GPT4V和GeminiPro-V等闭源API模型在大多数功能上都表现出色，尽管它们在对象幻觉方面存在一些漏洞。该评估强调了LVLM在实际应用中的改进领域，并作为未来多式联运技术进步的基础评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

TinyLVLM-eHub: Towards Comprehensive and Efficient Evaluation for Large Vision-Language Models

Large Vision-Language Models (LVLMs) have made significant strides in various multimodal tasks. Notably, GPT4V, Claude, Gemini, and others showcase exceptional multimodal capabilities, marked by profound comprehension and reasoning skills. This study introduces a comprehensive and efficient evaluation framework, TinyLVLM-eHub, to assess LVLMs’ performance, including proprietary models. TinyLVLM-eHub covers six key multimodal capabilities, such as visual perception, knowledge acquisition, reasoning, commonsense understanding, object hallucination, and embodied intelligence. The benchmark, utilizing 2.1K image-text pairs, provides a user-friendly and accessible platform for LVLM evaluation. The evaluation employs the ChatGPT Ensemble Evaluation (CEE) method, which improves alignment with human evaluation compared to word-matching approaches. Results reveal that closed-source API models like GPT4V and GeminiPro-V excel in most capabilities compared to previous open-source LVLMs, though they show some vulnerability in object hallucination. This evaluation underscores areas for LVLM improvement in real-world applications and serves as a foundational assessment for future multimodal advancements.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Big Data Multiple-

CiteScore

11.80

自引率

2.80%

发文量

114

期刊介绍： The IEEE Transactions on Big Data publishes peer-reviewed articles focusing on big data. These articles present innovative research ideas and application results across disciplines, including novel theories, algorithms, and applications. Research areas cover a wide range, such as big data analytics, visualization, curation, management, semantics, infrastructure, standards, performance analysis, intelligence extraction, scientific discovery, security, privacy, and legal issues specific to big data. The journal also prioritizes applications of big data in fields generating massive datasets.