POINTS: Improving Your Vision-language Model with Affordable Strategies

arXiv - CS - Multimedia Pub Date : 2024-09-07 DOI:arxiv-2409.04828

Yuan Liu, Zhongyin Zhao, Ziyuan Zhuang, Le Tian, Xiao Zhou, Jie Zhou

{"title":"POINTS: Improving Your Vision-language Model with Affordable Strategies","authors":"Yuan Liu, Zhongyin Zhao, Ziyuan Zhuang, Le Tian, Xiao Zhou, Jie Zhou","doi":"arxiv-2409.04828","DOIUrl":null,"url":null,"abstract":"In recent years, vision-language models have made significant strides,\nexcelling in tasks like optical character recognition and geometric\nproblem-solving. However, several critical issues remain: 1) Proprietary models\noften lack transparency about their architectures, while open-source models\nneed more detailed ablations of their training strategies. 2) Pre-training data\nin open-source works is under-explored, with datasets added empirically, making\nthe process cumbersome. 3) Fine-tuning often focuses on adding datasets,\nleading to diminishing returns. To address these issues, we propose the\nfollowing contributions: 1) We trained a robust baseline model using the latest\nadvancements in vision-language models, introducing effective improvements and\nconducting comprehensive ablation and validation for each technique. 2)\nInspired by recent work on large language models, we filtered pre-training data\nusing perplexity, selecting the lowest perplexity data for training. This\napproach allowed us to train on a curated 1M dataset, achieving competitive\nperformance. 3) During visual instruction tuning, we used model soup on\ndifferent datasets when adding more datasets yielded marginal improvements.\nThese innovations resulted in a 9B parameter model that performs competitively\nwith state-of-the-art models. Our strategies are efficient and lightweight,\nmaking them easily adoptable by the community.","PeriodicalId":501480,"journal":{"name":"arXiv - CS - Multimedia","volume":"15 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Multimedia","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.04828","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

In recent years, vision-language models have made significant strides, excelling in tasks like optical character recognition and geometric problem-solving. However, several critical issues remain: 1) Proprietary models often lack transparency about their architectures, while open-source models need more detailed ablations of their training strategies. 2) Pre-training data in open-source works is under-explored, with datasets added empirically, making the process cumbersome. 3) Fine-tuning often focuses on adding datasets, leading to diminishing returns. To address these issues, we propose the following contributions: 1) We trained a robust baseline model using the latest advancements in vision-language models, introducing effective improvements and conducting comprehensive ablation and validation for each technique. 2) Inspired by recent work on large language models, we filtered pre-training data using perplexity, selecting the lowest perplexity data for training. This approach allowed us to train on a curated 1M dataset, achieving competitive performance. 3) During visual instruction tuning, we used model soup on different datasets when adding more datasets yielded marginal improvements. These innovations resulted in a 9B parameter model that performs competitively with state-of-the-art models. Our strategies are efficient and lightweight, making them easily adoptable by the community.

查看原文本刊更多论文

要点：用经济实惠的策略改进您的视觉语言模式

近年来，视觉语言模型取得了长足进步，在光学字符识别和几何问题解决等任务中表现出色。然而，几个关键问题依然存在：1) 专有模型的架构往往缺乏透明度，而开源模型则需要更详细的训练策略说明。2）开源模型的预训练数据还未得到充分开发，数据集是根据经验添加的，这使得整个过程非常繁琐。3）微调往往集中在增加数据集上，导致收益递减。为了解决这些问题，我们提出了以下贡献：1）我们利用视觉语言模型的最新进展训练了一个稳健的基线模型，引入了有效的改进措施，并对每种技术进行了全面的消减和验证。2）受近期大型语言模型研究的启发，我们利用plexity过滤了预训练数据，选择plexity最低的数据进行训练。这种方法使我们能够在一个经过策划的 100 万数据集上进行训练，并取得了具有竞争力的性能。3) 在视觉指令调整过程中，当添加更多数据集只产生边际改进时，我们在不同数据集上使用了模型汤。我们的策略既高效又轻便，很容易被社区采用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Multimedia

自引率

0.00%

发文量