POINTS: Improving Your Vision-language Model with Affordable Strategies

Yuan Liu, Zhongyin Zhao, Ziyuan Zhuang, Le Tian, Xiao Zhou, Jie Zhou
{"title":"POINTS: Improving Your Vision-language Model with Affordable Strategies","authors":"Yuan Liu, Zhongyin Zhao, Ziyuan Zhuang, Le Tian, Xiao Zhou, Jie Zhou","doi":"arxiv-2409.04828","DOIUrl":null,"url":null,"abstract":"In recent years, vision-language models have made significant strides,\nexcelling in tasks like optical character recognition and geometric\nproblem-solving. However, several critical issues remain: 1) Proprietary models\noften lack transparency about their architectures, while open-source models\nneed more detailed ablations of their training strategies. 2) Pre-training data\nin open-source works is under-explored, with datasets added empirically, making\nthe process cumbersome. 3) Fine-tuning often focuses on adding datasets,\nleading to diminishing returns. To address these issues, we propose the\nfollowing contributions: 1) We trained a robust baseline model using the latest\nadvancements in vision-language models, introducing effective improvements and\nconducting comprehensive ablation and validation for each technique. 2)\nInspired by recent work on large language models, we filtered pre-training data\nusing perplexity, selecting the lowest perplexity data for training. This\napproach allowed us to train on a curated 1M dataset, achieving competitive\nperformance. 3) During visual instruction tuning, we used model soup on\ndifferent datasets when adding more datasets yielded marginal improvements.\nThese innovations resulted in a 9B parameter model that performs competitively\nwith state-of-the-art models. Our strategies are efficient and lightweight,\nmaking them easily adoptable by the community.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"15 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.04828","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

In recent years, vision-language models have made significant strides, excelling in tasks like optical character recognition and geometric problem-solving. However, several critical issues remain: 1) Proprietary models often lack transparency about their architectures, while open-source models need more detailed ablations of their training strategies. 2) Pre-training data in open-source works is under-explored, with datasets added empirically, making the process cumbersome. 3) Fine-tuning often focuses on adding datasets, leading to diminishing returns. To address these issues, we propose the following contributions: 1) We trained a robust baseline model using the latest advancements in vision-language models, introducing effective improvements and conducting comprehensive ablation and validation for each technique. 2) Inspired by recent work on large language models, we filtered pre-training data using perplexity, selecting the lowest perplexity data for training. This approach allowed us to train on a curated 1M dataset, achieving competitive performance. 3) During visual instruction tuning, we used model soup on different datasets when adding more datasets yielded marginal improvements. These innovations resulted in a 9B parameter model that performs competitively with state-of-the-art models. Our strategies are efficient and lightweight, making them easily adoptable by the community.
要点:用经济实惠的策略改进您的视觉语言模式
近年来,视觉语言模型取得了长足进步,在光学字符识别和几何问题解决等任务中表现出色。然而,几个关键问题依然存在:1) 专有模型的架构往往缺乏透明度,而开源模型则需要更详细的训练策略说明。2)开源模型的预训练数据还未得到充分开发,数据集是根据经验添加的,这使得整个过程非常繁琐。3)微调往往集中在增加数据集上,导致收益递减。为了解决这些问题,我们提出了以下贡献:1)我们利用视觉语言模型的最新进展训练了一个稳健的基线模型,引入了有效的改进措施,并对每种技术进行了全面的消减和验证。2)受近期大型语言模型研究的启发,我们利用plexity过滤了预训练数据,选择plexity最低的数据进行训练。这种方法使我们能够在一个经过策划的 100 万数据集上进行训练,并取得了具有竞争力的性能。3) 在视觉指令调整过程中,当添加更多数据集只产生边际改进时,我们在不同数据集上使用了模型汤。我们的策略既高效又轻便,很容易被社区采用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信