自适应采样方法有助于确定可靠的数据集大小，用于循证建模。

IF 3.9 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Frontiers in bioinformatics Pub Date : 2025-09-04 eCollection Date: 2025-01-01 DOI:10.3389/fbinf.2025.1528515

Tim Breitenbach, Thomas Dandekar

{"title":"自适应采样方法有助于确定可靠的数据集大小，用于循证建模。","authors":"Tim Breitenbach, Thomas Dandekar","doi":"10.3389/fbinf.2025.1528515","DOIUrl":null,"url":null,"abstract":"How can we be sure that there is sufficient data for our model, such that the predictions remain reliable on unseen data and the conclusions drawn from the fitted model would not vary significantly when using a different sample of the same size? We answer these and related questions through a systematic approach that examines the data size and the corresponding gains in accuracy. Assuming the sample data are drawn from a data pool with no data drift, the law of large numbers ensures that a model converges to its ground truth accuracy. Our approach provides a heuristic method for investigating the speed of convergence with respect to the size of the data sample. This relationship is estimated using sampling methods, which introduces a variation in the convergence speed results across different runs. To stabilize results-so that conclusions do not depend on the run-and extract the most reliable information encoded in the available data regarding convergence speed, the presented method automatically determines a sufficient number of repetitions to reduce sampling deviations below a predefined threshold, thereby ensuring the reliability of conclusions about the required amount of data.","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"5 ","pages":"1528515"},"PeriodicalIF":3.9000,"publicationDate":"2025-09-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12444090/pdf/","citationCount":"0","resultStr":"{\"title\":\"Adaptive sampling methods facilitate the determination of reliable dataset sizes for evidence-based modeling.\",\"authors\":\"Tim Breitenbach, Thomas Dandekar\",\"doi\":\"10.3389/fbinf.2025.1528515\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"How can we be sure that there is sufficient data for our model, such that the predictions remain reliable on unseen data and the conclusions drawn from the fitted model would not vary significantly when using a different sample of the same size? We answer these and related questions through a systematic approach that examines the data size and the corresponding gains in accuracy. Assuming the sample data are drawn from a data pool with no data drift, the law of large numbers ensures that a model converges to its ground truth accuracy. Our approach provides a heuristic method for investigating the speed of convergence with respect to the size of the data sample. This relationship is estimated using sampling methods, which introduces a variation in the convergence speed results across different runs. To stabilize results-so that conclusions do not depend on the run-and extract the most reliable information encoded in the available data regarding convergence speed, the presented method automatically determines a sufficient number of repetitions to reduce sampling deviations below a predefined threshold, thereby ensuring the reliability of conclusions about the required amount of data.\",\"PeriodicalId\":73066,\"journal\":{\"name\":\"Frontiers in bioinformatics\",\"volume\":\"5 \",\"pages\":\"1528515\"},\"PeriodicalIF\":3.9000,\"publicationDate\":\"2025-09-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12444090/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Frontiers in bioinformatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3389/fbinf.2025.1528515\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q2\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fbinf.2025.1528515","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

我们如何确保我们的模型有足够的数据，使得预测在未知数据上仍然可靠，并且当使用相同大小的不同样本时，从拟合模型得出的结论不会有显着变化？我们通过一种系统的方法来回答这些和相关的问题，该方法检查数据大小和相应的准确性增益。假设样本数据是从没有数据漂移的数据池中提取的，那么大数定律可以确保模型收敛到其基本真值精度。我们的方法提供了一种启发式的方法来研究关于数据样本大小的收敛速度。这种关系是使用抽样方法估计的，这在不同的运行中引入了收敛速度结果的变化。为了稳定结果，使结论不依赖于运行，并提取有关收敛速度的可用数据中编码的最可靠的信息，所提出的方法自动确定足够的重复次数，以减少采样偏差低于预定义的阈值，从而确保有关所需数据量的结论的可靠性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Adaptive sampling methods facilitate the determination of reliable dataset sizes for evidence-based modeling.

查看原文本刊更多论文

Adaptive sampling methods facilitate the determination of reliable dataset sizes for evidence-based modeling.

How can we be sure that there is sufficient data for our model, such that the predictions remain reliable on unseen data and the conclusions drawn from the fitted model would not vary significantly when using a different sample of the same size? We answer these and related questions through a systematic approach that examines the data size and the corresponding gains in accuracy. Assuming the sample data are drawn from a data pool with no data drift, the law of large numbers ensures that a model converges to its ground truth accuracy. Our approach provides a heuristic method for investigating the speed of convergence with respect to the size of the data sample. This relationship is estimated using sampling methods, which introduces a variation in the convergence speed results across different runs. To stabilize results-so that conclusions do not depend on the run-and extract the most reliable information encoded in the available data regarding convergence speed, the presented method automatically determines a sufficient number of repetitions to reduce sampling deviations below a predefined threshold, thereby ensuring the reliability of conclusions about the required amount of data.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Frontiers in bioinformatics

CiteScore

2.60

自引率

0.00%

发文量