Simon Yu, Liangyu Chen, Sara Ahmadian, Marzieh Fadaee
{"title":"Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement","authors":"Simon Yu, Liangyu Chen, Sara Ahmadian, Marzieh Fadaee","doi":"arxiv-2409.11378","DOIUrl":null,"url":null,"abstract":"Finetuning large language models on instruction data is crucial for enhancing\npre-trained knowledge and improving instruction-following capabilities. As\ninstruction datasets proliferate, selecting optimal data for effective training\nbecomes increasingly important. This work addresses the question: How can we\ndetermine the optimal subset of data for effective training? While existing\nresearch often emphasizes local criteria like instance quality for subset\nselection, we argue that a global approach focused on data diversity is more\ncritical. Our method employs k-means clustering to ensure the selected subset\neffectively represents the full dataset. We propose an iterative refinement\nmethod inspired by active learning techniques to resample instances from\nclusters, reassessing each cluster's importance and sampling weight in every\ntraining iteration. This approach reduces the effect of outliers and\nautomatically filters out clusters containing low-quality data. Through\nextensive evaluation across natural language reasoning, general world\nknowledge, code and math reasoning tasks, and by fine-tuning models from\nvarious families, we observe consistent improvements, achieving a 7% increase\nover random selection and a 3.8% improvement over state-of-the-art sampling\nmethods. Our work highlights the significance of diversity-first sampling when\nfinetuning LLMs to enhance performance across a broad array of evaluation\ntasks. Our code is available at\nhttps://github.com/for-ai/iterative-data-selection.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"91 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11378","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Finetuning large language models on instruction data is crucial for enhancing
pre-trained knowledge and improving instruction-following capabilities. As
instruction datasets proliferate, selecting optimal data for effective training
becomes increasingly important. This work addresses the question: How can we
determine the optimal subset of data for effective training? While existing
research often emphasizes local criteria like instance quality for subset
selection, we argue that a global approach focused on data diversity is more
critical. Our method employs k-means clustering to ensure the selected subset
effectively represents the full dataset. We propose an iterative refinement
method inspired by active learning techniques to resample instances from
clusters, reassessing each cluster's importance and sampling weight in every
training iteration. This approach reduces the effect of outliers and
automatically filters out clusters containing low-quality data. Through
extensive evaluation across natural language reasoning, general world
knowledge, code and math reasoning tasks, and by fine-tuning models from
various families, we observe consistent improvements, achieving a 7% increase
over random selection and a 3.8% improvement over state-of-the-art sampling
methods. Our work highlights the significance of diversity-first sampling when
finetuning LLMs to enhance performance across a broad array of evaluation
tasks. Our code is available at
https://github.com/for-ai/iterative-data-selection.