Improving Pretraining Data Using Perplexity Correlations

arXiv - STAT - Machine Learning Pub Date : 2024-09-09 DOI:arxiv-2409.05816

Tristan Thrush, Christopher Potts, Tatsunori Hashimoto

引用次数: 0

Abstract

Quality pretraining data is often seen as the key to high-performance language models. However, progress in understanding pretraining data has been slow due to the costly pretraining runs required for data selection experiments. We present a framework that avoids these costs and selects high-quality pretraining data without any LLM training of our own. Our work is based on a simple observation: LLM losses on many pretraining texts are correlated with downstream benchmark performance, and selecting high-correlation documents is an effective pretraining data selection method. We build a new statistical framework for data selection centered around estimates of perplexity-benchmark correlations and perform data selection using a sample of 90 LLMs taken from the Open LLM Leaderboard on texts from tens of thousands of web domains. In controlled pretraining experiments at the 160M parameter scale on 8 benchmarks, our approach outperforms DSIR on every benchmark, while matching the best data selector found in DataComp-LM, a hand-engineered bigram classifier.

查看原文本刊更多论文

利用复杂性相关性改进预训练数据

高质量的预训练数据通常被视为高性能语言模型的关键。然而，由于数据选择实验需要高成本的预训练运行，因此在理解预训练数据方面进展缓慢。我们提出了一个框架，它可以避免这些成本，并在不进行任何 LLM 训练的情况下选择高质量的预训练数据。我们的工作基于一个简单的观察结果：我们构建了一个新的数据选择统计框架，该框架以对困惑度-基准相关性的估计为中心，并使用从开放 LLM 排行榜（Open LLM Leaderboard）中抽取的 90 个 LLM 样本，对来自数万个网络域的文本进行数据选择。在 8 个基准的 1.6 亿参数规模的受控预训练实验中，我们的方法在每个基准上的表现都优于 DSIR，同时与 DataComp-LM 中的最佳数据选择器（一种人工设计的 bigram 分类器）不相上下。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - STAT - Machine Learning

自引率

0.00%

发文量