基于小先导数据的大数据集分类器准确率预测的概率方法。

Proceedings of machine learning research Pub Date : 2023-12-01

Ethan Harvey, Wansu Chen, David M Kent, Michael C Hughes

{"title":"基于小先导数据的大数据集分类器准确率预测的概率方法。","authors":"Ethan Harvey, Wansu Chen, David M Kent, Michael C Hughes","doi":"","DOIUrl":null,"url":null,"abstract":"Practitioners building classifiers often start with a smaller pilot dataset and plan to grow to larger data in the near future. Such projects need a toolkit for extrapolating how much classifier accuracy may improve from a 2x, 10x, or 50x increase in data size. While existing work has focused on finding a single \"best-fit\" curve using various functional forms like power laws, we argue that modeling and assessing the uncertainty of predictions is critical yet has seen less attention. In this paper, we propose a Gaussian process model to obtain probabilistic extrapolations of accuracy or similar performance metrics as dataset size increases. We evaluate our approach in terms of error, likelihood, and coverage across six datasets. Though we focus on medical tasks and image modalities, our open source approach generalizes to any kind of classifier.","PeriodicalId":74504,"journal":{"name":"Proceedings of machine learning research","volume":"225 ","pages":"129-144"},"PeriodicalIF":0.0000,"publicationDate":"2023-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11826957/pdf/","citationCount":"0","resultStr":"{\"title\":\"A Probabilistic Method to Predict Classifier Accuracy on Larger Datasets given Small Pilot Data.\",\"authors\":\"Ethan Harvey, Wansu Chen, David M Kent, Michael C Hughes\",\"doi\":\"\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Practitioners building classifiers often start with a smaller pilot dataset and plan to grow to larger data in the near future. Such projects need a toolkit for extrapolating how much classifier accuracy may improve from a 2x, 10x, or 50x increase in data size. While existing work has focused on finding a single \\\"best-fit\\\" curve using various functional forms like power laws, we argue that modeling and assessing the uncertainty of predictions is critical yet has seen less attention. In this paper, we propose a Gaussian process model to obtain probabilistic extrapolations of accuracy or similar performance metrics as dataset size increases. We evaluate our approach in terms of error, likelihood, and coverage across six datasets. Though we focus on medical tasks and image modalities, our open source approach generalizes to any kind of classifier.\",\"PeriodicalId\":74504,\"journal\":{\"name\":\"Proceedings of machine learning research\",\"volume\":\"225 \",\"pages\":\"129-144\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11826957/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of machine learning research\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of machine learning research","FirstCategoryId":"1085","ListUrlMain":"","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

构建分类器的从业者通常从一个较小的试验数据集开始，并计划在不久的将来发展到更大的数据集。这样的项目需要一个工具包来推断数据大小增加2倍、10倍或50倍会提高多少分类器的准确性。虽然现有的工作主要集中在寻找一个单一的“最佳拟合”曲线，使用各种函数形式，如幂律，我们认为，建模和评估预测的不确定性是至关重要的，但很少有人关注。在本文中，我们提出了一个高斯过程模型，以获得随着数据集大小增加的准确性或类似性能指标的概率外推。我们根据六个数据集的误差、可能性和覆盖率来评估我们的方法。虽然我们专注于医疗任务和图像模式，但我们的开源方法可以推广到任何类型的分类器。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

本刊更多论文

A Probabilistic Method to Predict Classifier Accuracy on Larger Datasets given Small Pilot Data.

Practitioners building classifiers often start with a smaller pilot dataset and plan to grow to larger data in the near future. Such projects need a toolkit for extrapolating how much classifier accuracy may improve from a 2x, 10x, or 50x increase in data size. While existing work has focused on finding a single "best-fit" curve using various functional forms like power laws, we argue that modeling and assessing the uncertainty of predictions is critical yet has seen less attention. In this paper, we propose a Gaussian process model to obtain probabilistic extrapolations of accuracy or similar performance metrics as dataset size increases. We evaluate our approach in terms of error, likelihood, and coverage across six datasets. Though we focus on medical tasks and image modalities, our open source approach generalizes to any kind of classifier.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of machine learning research

自引率

0.00%

发文量