Qwen2-VL：增强视觉语言模型在任何分辨率下对世界的感知能力

arXiv - CS - Computer Vision and Pattern Recognition Pub Date : 2024-09-18 DOI:arxiv-2409.12191

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, Junyang Lin

{"title":"Qwen2-VL：增强视觉语言模型在任何分辨率下对世界的感知能力","authors":"Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, Junyang Lin","doi":"arxiv-2409.12191","DOIUrl":null,"url":null,"abstract":"We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL\nmodels that redefines the conventional predetermined-resolution approach in\nvisual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism,\nwhich enables the model to dynamically process images of varying resolutions\ninto different numbers of visual tokens. This approach allows the model to\ngenerate more efficient and accurate visual representations, closely aligning\nwith human perceptual processes. The model also integrates Multimodal Rotary\nPosition Embedding (M-RoPE), facilitating the effective fusion of positional\ninformation across text, images, and videos. We employ a unified paradigm for\nprocessing both images and videos, enhancing the model's visual perception\ncapabilities. To explore the potential of large multimodal models, Qwen2-VL\ninvestigates the scaling laws for large vision-language models (LVLMs). By\nscaling both the model size-with versions at 2B, 8B, and 72B parameters-and the\namount of training data, the Qwen2-VL Series achieves highly competitive\nperformance. Notably, the Qwen2-VL-72B model achieves results comparable to\nleading models such as GPT-4o and Claude3.5-Sonnet across various multimodal\nbenchmarks, outperforming other generalist models. Code is available at\n\\url{https://github.com/QwenLM/Qwen2-VL}.","PeriodicalId":501130,"journal":{"name":"arXiv - CS - Computer Vision and Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution\",\"authors\":\"Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Yang Fan, Kai Dang, Mengfei Du, Xuancheng Ren, Rui Men, Dayiheng Liu, Chang Zhou, Jingren Zhou, Junyang Lin\",\"doi\":\"arxiv-2409.12191\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL\\nmodels that redefines the conventional predetermined-resolution approach in\\nvisual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism,\\nwhich enables the model to dynamically process images of varying resolutions\\ninto different numbers of visual tokens. This approach allows the model to\\ngenerate more efficient and accurate visual representations, closely aligning\\nwith human perceptual processes. The model also integrates Multimodal Rotary\\nPosition Embedding (M-RoPE), facilitating the effective fusion of positional\\ninformation across text, images, and videos. We employ a unified paradigm for\\nprocessing both images and videos, enhancing the model's visual perception\\ncapabilities. To explore the potential of large multimodal models, Qwen2-VL\\ninvestigates the scaling laws for large vision-language models (LVLMs). By\\nscaling both the model size-with versions at 2B, 8B, and 72B parameters-and the\\namount of training data, the Qwen2-VL Series achieves highly competitive\\nperformance. Notably, the Qwen2-VL-72B model achieves results comparable to\\nleading models such as GPT-4o and Claude3.5-Sonnet across various multimodal\\nbenchmarks, outperforming other generalist models. Code is available at\\n\\\\url{https://github.com/QwenLM/Qwen2-VL}.\",\"PeriodicalId\":501130,\"journal\":{\"name\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computer Vision and Pattern Recognition\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.12191\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computer Vision and Pattern Recognition","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.12191","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

我们推出的 Qwen2-VL 系列是之前 Qwen-VL 模型的高级升级版，它重新定义了传统的预定分辨率视觉处理方法。Qwen2-VL 引入了 Naive 动态分辨率机制，使模型能够动态地将不同分辨率的图像处理成不同数量的视觉标记。这种方法能让模型生成更高效、更准确的视觉表征，与人类的感知过程紧密结合。该模型还集成了多模态旋转位置嵌入（M-RoPE）技术，有助于有效融合文本、图像和视频中的位置信息。我们采用统一的范式处理图像和视频，增强了模型的视觉感知能力。为了探索大型多模态模型的潜力，Qwen2-VL 研究了大型视觉语言模型（LVLM）的缩放规律。通过对模型大小（2B、8B 和 72B 参数版本）和训练数据量进行缩放，Qwen2-VL 系列取得了极具竞争力的性能。值得注意的是，Qwen2-VL-72B 模型在各种多模态基准测试中取得的结果可与 GPT-4o 和 Claude3.5-Sonnet 等领先模型相媲美，表现优于其他通用模型。代码可在（url{https://github.com/QwenLM/Qwen2-VL}.

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

We present the Qwen2-VL Series, an advanced upgrade of the previous Qwen-VL models that redefines the conventional predetermined-resolution approach in visual processing. Qwen2-VL introduces the Naive Dynamic Resolution mechanism, which enables the model to dynamically process images of varying resolutions into different numbers of visual tokens. This approach allows the model to generate more efficient and accurate visual representations, closely aligning with human perceptual processes. The model also integrates Multimodal Rotary Position Embedding (M-RoPE), facilitating the effective fusion of positional information across text, images, and videos. We employ a unified paradigm for processing both images and videos, enhancing the model's visual perception capabilities. To explore the potential of large multimodal models, Qwen2-VL investigates the scaling laws for large vision-language models (LVLMs). By scaling both the model size-with versions at 2B, 8B, and 72B parameters-and the amount of training data, the Qwen2-VL Series achieves highly competitive performance. Notably, the Qwen2-VL-72B model achieves results comparable to leading models such as GPT-4o and Claude3.5-Sonnet across various multimodal benchmarks, outperforming other generalist models. Code is available at \url{https://github.com/QwenLM/Qwen2-VL}.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - CS - Computer Vision and Pattern Recognition

自引率

0.00%

发文量