{"title":"Optimizing Sample Size for Supervised Machine Learning with Bulk Transcriptomic Sequencing: A Learning Curve Approach","authors":"Yunhui Qi, Xinyi Wang, Li-Xuan Qin","doi":"arxiv-2409.06180","DOIUrl":null,"url":null,"abstract":"Accurate sample classification using transcriptomics data is crucial for\nadvancing personalized medicine. Achieving this goal necessitates determining a\nsuitable sample size that ensures adequate statistical power without undue\nresource allocation. Current sample size calculation methods rely on\nassumptions and algorithms that may not align with supervised machine learning\ntechniques for sample classification. Addressing this critical methodological\ngap, we present a novel computational approach that establishes the\npower-versus-sample-size relationship by employing a data augmentation strategy\nfollowed by fitting a learning curve. We comprehensively evaluated its\nperformance for microRNA and RNA sequencing data, considering diverse data\ncharacteristics and algorithm configurations, based on a spectrum of evaluation\nmetrics. To foster accessibility and reproducibility, the Python and R code for\nimplementing our approach is available on GitHub. Its deployment will\nsignificantly facilitate the adoption of machine learning in transcriptomics\nstudies and accelerate their translation into clinically useful classifiers for\npersonalized treatment.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"16 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Methodology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06180","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Accurate sample classification using transcriptomics data is crucial for
advancing personalized medicine. Achieving this goal necessitates determining a
suitable sample size that ensures adequate statistical power without undue
resource allocation. Current sample size calculation methods rely on
assumptions and algorithms that may not align with supervised machine learning
techniques for sample classification. Addressing this critical methodological
gap, we present a novel computational approach that establishes the
power-versus-sample-size relationship by employing a data augmentation strategy
followed by fitting a learning curve. We comprehensively evaluated its
performance for microRNA and RNA sequencing data, considering diverse data
characteristics and algorithm configurations, based on a spectrum of evaluation
metrics. To foster accessibility and reproducibility, the Python and R code for
implementing our approach is available on GitHub. Its deployment will
significantly facilitate the adoption of machine learning in transcriptomics
studies and accelerate their translation into clinically useful classifiers for
personalized treatment.