Optimizing Sample Size for Supervised Machine Learning with Bulk Transcriptomic Sequencing: A Learning Curve Approach

Yunhui Qi, Xinyi Wang, Li-Xuan Qin
{"title":"Optimizing Sample Size for Supervised Machine Learning with Bulk Transcriptomic Sequencing: A Learning Curve Approach","authors":"Yunhui Qi, Xinyi Wang, Li-Xuan Qin","doi":"arxiv-2409.06180","DOIUrl":null,"url":null,"abstract":"Accurate sample classification using transcriptomics data is crucial for\nadvancing personalized medicine. Achieving this goal necessitates determining a\nsuitable sample size that ensures adequate statistical power without undue\nresource allocation. Current sample size calculation methods rely on\nassumptions and algorithms that may not align with supervised machine learning\ntechniques for sample classification. Addressing this critical methodological\ngap, we present a novel computational approach that establishes the\npower-versus-sample-size relationship by employing a data augmentation strategy\nfollowed by fitting a learning curve. We comprehensively evaluated its\nperformance for microRNA and RNA sequencing data, considering diverse data\ncharacteristics and algorithm configurations, based on a spectrum of evaluation\nmetrics. To foster accessibility and reproducibility, the Python and R code for\nimplementing our approach is available on GitHub. Its deployment will\nsignificantly facilitate the adoption of machine learning in transcriptomics\nstudies and accelerate their translation into clinically useful classifiers for\npersonalized treatment.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"16 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Methodology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06180","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Accurate sample classification using transcriptomics data is crucial for advancing personalized medicine. Achieving this goal necessitates determining a suitable sample size that ensures adequate statistical power without undue resource allocation. Current sample size calculation methods rely on assumptions and algorithms that may not align with supervised machine learning techniques for sample classification. Addressing this critical methodological gap, we present a novel computational approach that establishes the power-versus-sample-size relationship by employing a data augmentation strategy followed by fitting a learning curve. We comprehensively evaluated its performance for microRNA and RNA sequencing data, considering diverse data characteristics and algorithm configurations, based on a spectrum of evaluation metrics. To foster accessibility and reproducibility, the Python and R code for implementing our approach is available on GitHub. Its deployment will significantly facilitate the adoption of machine learning in transcriptomics studies and accelerate their translation into clinically useful classifiers for personalized treatment.
利用大容量转录组测序优化监督机器学习的样本量:学习曲线方法
利用转录组学数据对样本进行准确分类对于推进个性化医疗至关重要。要实现这一目标,就必须确定合适的样本量,以确保在不分配过多资源的情况下获得足够的统计能力。目前的样本量计算方法所依赖的假设和算法可能与用于样本分类的监督机器学习技术不一致。针对这一关键的方法论差距,我们提出了一种新颖的计算方法,通过采用数据扩增策略和拟合学习曲线来建立统计能力与样本量之间的关系。考虑到不同的数据特征和算法配置,我们基于一系列评估指标,全面评估了该方法在 microRNA 和 RNA 测序数据方面的性能。为了提高可访问性和可重复性,我们在 GitHub 上提供了实现我们方法的 Python 和 R 代码。它的部署将极大地促进机器学习在转录组学研究中的应用,并加速将其转化为临床上有用的分类器,用于个性化治疗。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信