Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement

arXiv - CS - Computation and Language Pub Date : 2024-09-17 DOI:arxiv-2409.11378

Simon Yu, Liangyu Chen, Sara Ahmadian, Marzieh Fadaee

{"title":"Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement","authors":"Simon Yu, Liangyu Chen, Sara Ahmadian, Marzieh Fadaee","doi":"arxiv-2409.11378","DOIUrl":null,"url":null,"abstract":"Finetuning large language models on instruction data is crucial for enhancing\npre-trained knowledge and improving instruction-following capabilities. As\ninstruction datasets proliferate, selecting optimal data for effective training\nbecomes increasingly important. This work addresses the question: How can we\ndetermine the optimal subset of data for effective training? While existing\nresearch often emphasizes local criteria like instance quality for subset\nselection, we argue that a global approach focused on data diversity is more\ncritical. Our method employs k-means clustering to ensure the selected subset\neffectively represents the full dataset. We propose an iterative refinement\nmethod inspired by active learning techniques to resample instances from\nclusters, reassessing each cluster's importance and sampling weight in every\ntraining iteration. This approach reduces the effect of outliers and\nautomatically filters out clusters containing low-quality data. Through\nextensive evaluation across natural language reasoning, general world\nknowledge, code and math reasoning tasks, and by fine-tuning models from\nvarious families, we observe consistent improvements, achieving a 7% increase\nover random selection and a 3.8% improvement over state-of-the-art sampling\nmethods. Our work highlights the significance of diversity-first sampling when\nfinetuning LLMs to enhance performance across a broad array of evaluation\ntasks. Our code is available at\nhttps://github.com/for-ai/iterative-data-selection.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"91 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11378","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Finetuning large language models on instruction data is crucial for enhancing pre-trained knowledge and improving instruction-following capabilities. As instruction datasets proliferate, selecting optimal data for effective training becomes increasingly important. This work addresses the question: How can we determine the optimal subset of data for effective training? While existing research often emphasizes local criteria like instance quality for subset selection, we argue that a global approach focused on data diversity is more critical. Our method employs k-means clustering to ensure the selected subset effectively represents the full dataset. We propose an iterative refinement method inspired by active learning techniques to resample instances from clusters, reassessing each cluster's importance and sampling weight in every training iteration. This approach reduces the effect of outliers and automatically filters out clusters containing low-quality data. Through extensive evaluation across natural language reasoning, general world knowledge, code and math reasoning tasks, and by fine-tuning models from various families, we observe consistent improvements, achieving a 7% increase over random selection and a 3.8% improvement over state-of-the-art sampling methods. Our work highlights the significance of diversity-first sampling when finetuning LLMs to enhance performance across a broad array of evaluation tasks. Our code is available at https://github.com/for-ai/iterative-data-selection.

查看原文本刊更多论文

多样化与征服：以多样性为中心的数据选择与迭代改进

在教学数据上对大型语言模型进行微调，对于增强预先训练的知识和提高教学能力至关重要。随着教学数据集的增多，选择最佳数据进行有效训练变得越来越重要。这项工作要解决的问题是：如何为有效训练确定最佳数据子集？虽然现有研究通常强调子集选择的局部标准，如实例质量，但我们认为，以数据多样性为重点的全局方法更为关键。我们的方法采用 k 均值聚类，以确保所选子集能有效代表整个数据集。我们提出了一种受主动学习技术启发的迭代改进方法，从聚类中重新抽取实例，在每次训练迭代中重新评估每个聚类的重要性和抽样权重。这种方法可以减少异常值的影响，并自动过滤掉包含低质量数据的聚类。通过对自然语言推理、一般世界知识、代码和数学推理任务的广泛评估，以及对不同系列模型的微调，我们观察到了一致的改进，比随机选择提高了 7%，比最先进的抽样方法提高了 3.8%。我们的工作凸显了多样性优先采样在调整 LLM 以提高各种评估任务性能时的重要性。我们的代码可在https://github.com/for-ai/iterative-data-selection。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Computation and Language

自引率

0.00%

发文量