如何使用学习曲线来评估使用机器学习算法开发的疟疾预测模型的样本量。

IF 3 3区医学 Q3 INFECTIOUS DISEASES

Malaria Journal Pub Date : 2025-07-24 DOI:10.1186/s12936-025-05479-3

Sophie G Zaloumis, Megha Rajasekhar, Julie A Simpson

{"title":"如何使用学习曲线来评估使用机器学习算法开发的疟疾预测模型的样本量。","authors":"Sophie G Zaloumis, Megha Rajasekhar, Julie A Simpson","doi":"10.1186/s12936-025-05479-3","DOIUrl":null,"url":null,"abstract":"Background: Machine learning algorithms have been used to predict malaria risk and severity, identify immunity biomarkers for malaria vaccine candidates, and determine molecular biomarkers of antimalarial drug resistance. Developing these prediction models requires large training datasets to ensure prediction accuracy when applied to new individuals in the target population. Learning curves can be used to assess the sample size required for the training dataset by evaluating the predictive performance of a model trained using different dataset sizes. These curves are agnostic to the specific prediction model, but their construction does require existing data. This tutorial demonstrates how to generate and interpret learning curves for malaria prediction models developed using machine learning algorithms.Methods: To illustrate the approach, training dataset sizes were evaluated to inform the design of a \"mock\" prediction modelling study aimed to predict the artemisinin resistance status of Plasmodium falciparum malaria isolates from gene expression data. Data were simulated based on a previously published in vivo parasite gene expression dataset, which contained transcriptomes of 1043 P. falciparum isolates from patients with acute malaria, of which 29% (299/1043) were from slow clearing infections (parasite clearance half-life > 5 h). Learning curves were produced for two machine learning algorithms, sparse Partial Least Squares-Discriminant Analysis plus Support Vector Machines (sPLSDA + SVMs) and random forests. Prediction error was measured using the balanced error rate (average of percentage of slow clearing infections incorrectly predicted as fast and percentage of fast clearing infections predicted as slow).Results: For this mock malaria prediction study, the balanced error rate on a test dataset not used for model training (208 samples) was 50% for sPLSDA + SVMs and 50% for random forests on the smallest training dataset evaluated (20 samples) and 14% for sPLSDA + SVMs and 22% for random forests on the largest training dataset evaluated (835 samples). The shape of the learning curves indicates that increasing the training dataset size beyond 835 samples is unlikely to significantly reduce the balanced error rates further.Conclusions: Learning curves are a simple tool that can be used to determine the minimum sample size required for future prediction modelling studies of different malaria outcomes that use machine learning algorithms for prediction. These curves need to be generated for each specific prediction modelling application.","PeriodicalId":18317,"journal":{"name":"Malaria Journal","volume":"24 1","pages":"242"},"PeriodicalIF":3.0000,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12291394/pdf/","citationCount":"0","resultStr":"{\"title\":\"How to use learning curves to evaluate the sample size for malaria prediction models developed using machine learning algorithms.\",\"authors\":\"Sophie G Zaloumis, Megha Rajasekhar, Julie A Simpson\",\"doi\":\"10.1186/s12936-025-05479-3\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Machine learning algorithms have been used to predict malaria risk and severity, identify immunity biomarkers for malaria vaccine candidates, and determine molecular biomarkers of antimalarial drug resistance. Developing these prediction models requires large training datasets to ensure prediction accuracy when applied to new individuals in the target population. Learning curves can be used to assess the sample size required for the training dataset by evaluating the predictive performance of a model trained using different dataset sizes. These curves are agnostic to the specific prediction model, but their construction does require existing data. This tutorial demonstrates how to generate and interpret learning curves for malaria prediction models developed using machine learning algorithms.Methods: To illustrate the approach, training dataset sizes were evaluated to inform the design of a \\\"mock\\\" prediction modelling study aimed to predict the artemisinin resistance status of Plasmodium falciparum malaria isolates from gene expression data. Data were simulated based on a previously published in vivo parasite gene expression dataset, which contained transcriptomes of 1043 P. falciparum isolates from patients with acute malaria, of which 29% (299/1043) were from slow clearing infections (parasite clearance half-life > 5 h). Learning curves were produced for two machine learning algorithms, sparse Partial Least Squares-Discriminant Analysis plus Support Vector Machines (sPLSDA + SVMs) and random forests. Prediction error was measured using the balanced error rate (average of percentage of slow clearing infections incorrectly predicted as fast and percentage of fast clearing infections predicted as slow).Results: For this mock malaria prediction study, the balanced error rate on a test dataset not used for model training (208 samples) was 50% for sPLSDA + SVMs and 50% for random forests on the smallest training dataset evaluated (20 samples) and 14% for sPLSDA + SVMs and 22% for random forests on the largest training dataset evaluated (835 samples). The shape of the learning curves indicates that increasing the training dataset size beyond 835 samples is unlikely to significantly reduce the balanced error rates further.Conclusions: Learning curves are a simple tool that can be used to determine the minimum sample size required for future prediction modelling studies of different malaria outcomes that use machine learning algorithms for prediction. These curves need to be generated for each specific prediction modelling application.\",\"PeriodicalId\":18317,\"journal\":{\"name\":\"Malaria Journal\",\"volume\":\"24 1\",\"pages\":\"242\"},\"PeriodicalIF\":3.0000,\"publicationDate\":\"2025-07-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12291394/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Malaria Journal\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1186/s12936-025-05479-3\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"INFECTIOUS DISEASES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Malaria Journal","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12936-025-05479-3","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"INFECTIOUS DISEASES","Score":null,"Total":0}

引用次数: 0

摘要

背景：机器学习算法已被用于预测疟疾风险和严重程度，确定疟疾候选疫苗的免疫生物标志物，并确定抗疟疾药物耐药性的分子生物标志物。开发这些预测模型需要大量的训练数据集，以确保在应用于目标人群中的新个体时预测的准确性。学习曲线可以通过评估使用不同数据集大小训练的模型的预测性能来评估训练数据集所需的样本量。这些曲线与具体的预测模型无关，但它们的构造确实需要现有的数据。本教程演示如何生成和解释使用机器学习算法开发的疟疾预测模型的学习曲线。方法：为了说明该方法，对训练数据集的大小进行了评估，以便为“模拟”预测模型研究的设计提供信息，该研究旨在根据基因表达数据预测恶性疟原虫疟疾分离株的青蒿素耐药性状况。数据基于先前发表的体内寄生虫基因表达数据集进行模拟，该数据集包含来自急性疟疾患者的1043株恶性疟原虫的转录组，其中29%（299/1043）来自缓慢清除感染（寄生虫清除半衰期bbb50 h）。为稀疏偏最小二乘-判别分析加支持向量机（sPLSDA + svm）和随机森林两种机器学习算法生成学习曲线。使用平衡错误率来测量预测误差（错误地预测为快速的缓慢清除感染百分比的平均值和预测为缓慢的快速清除感染百分比的平均值）。结果：在模拟疟疾预测研究中，在未用于模型训练的测试数据集（208个样本）上，sPLSDA + svm和随机森林在最小训练数据集（20个样本）上的平衡错误率分别为50%和50%，sPLSDA + svm和随机森林在最大训练数据集（835个样本）上的平衡错误率分别为14%和22%。学习曲线的形状表明，将训练数据集的大小增加到835个样本以上，不太可能进一步显著降低平衡错误率。结论：学习曲线是一种简单的工具，可用于确定使用机器学习算法进行预测的不同疟疾结果的未来预测建模研究所需的最小样本量。这些曲线需要为每个特定的预测建模应用程序生成。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

How to use learning curves to evaluate the sample size for malaria prediction models developed using machine learning algorithms.

查看原文本刊更多论文

How to use learning curves to evaluate the sample size for malaria prediction models developed using machine learning algorithms.

Background: Machine learning algorithms have been used to predict malaria risk and severity, identify immunity biomarkers for malaria vaccine candidates, and determine molecular biomarkers of antimalarial drug resistance. Developing these prediction models requires large training datasets to ensure prediction accuracy when applied to new individuals in the target population. Learning curves can be used to assess the sample size required for the training dataset by evaluating the predictive performance of a model trained using different dataset sizes. These curves are agnostic to the specific prediction model, but their construction does require existing data. This tutorial demonstrates how to generate and interpret learning curves for malaria prediction models developed using machine learning algorithms.

Methods: To illustrate the approach, training dataset sizes were evaluated to inform the design of a "mock" prediction modelling study aimed to predict the artemisinin resistance status of Plasmodium falciparum malaria isolates from gene expression data. Data were simulated based on a previously published in vivo parasite gene expression dataset, which contained transcriptomes of 1043 P. falciparum isolates from patients with acute malaria, of which 29% (299/1043) were from slow clearing infections (parasite clearance half-life > 5 h). Learning curves were produced for two machine learning algorithms, sparse Partial Least Squares-Discriminant Analysis plus Support Vector Machines (sPLSDA + SVMs) and random forests. Prediction error was measured using the balanced error rate (average of percentage of slow clearing infections incorrectly predicted as fast and percentage of fast clearing infections predicted as slow).

Results: For this mock malaria prediction study, the balanced error rate on a test dataset not used for model training (208 samples) was 50% for sPLSDA + SVMs and 50% for random forests on the smallest training dataset evaluated (20 samples) and 14% for sPLSDA + SVMs and 22% for random forests on the largest training dataset evaluated (835 samples). The shape of the learning curves indicates that increasing the training dataset size beyond 835 samples is unlikely to significantly reduce the balanced error rates further.

Conclusions: Learning curves are a simple tool that can be used to determine the minimum sample size required for future prediction modelling studies of different malaria outcomes that use machine learning algorithms for prediction. These curves need to be generated for each specific prediction modelling application.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Malaria Journal 医学-寄生虫学

CiteScore

5.10

自引率

23.30%

发文量

334

审稿时长

2-4 weeks

期刊介绍： Malaria Journal is aimed at the scientific community interested in malaria in its broadest sense. It is the only journal that publishes exclusively articles on malaria and, as such, it aims to bring together knowledge from the different specialities involved in this very broad discipline, from the bench to the bedside and to the field.