利用大容量转录组测序优化监督机器学习的样本量：学习曲线方法

arXiv - STAT - Methodology Pub Date : 2024-09-10 DOI:arxiv-2409.06180

Yunhui Qi, Xinyi Wang, Li-Xuan Qin

{"title":"利用大容量转录组测序优化监督机器学习的样本量：学习曲线方法","authors":"Yunhui Qi, Xinyi Wang, Li-Xuan Qin","doi":"arxiv-2409.06180","DOIUrl":null,"url":null,"abstract":"Accurate sample classification using transcriptomics data is crucial for\nadvancing personalized medicine. Achieving this goal necessitates determining a\nsuitable sample size that ensures adequate statistical power without undue\nresource allocation. Current sample size calculation methods rely on\nassumptions and algorithms that may not align with supervised machine learning\ntechniques for sample classification. Addressing this critical methodological\ngap, we present a novel computational approach that establishes the\npower-versus-sample-size relationship by employing a data augmentation strategy\nfollowed by fitting a learning curve. We comprehensively evaluated its\nperformance for microRNA and RNA sequencing data, considering diverse data\ncharacteristics and algorithm configurations, based on a spectrum of evaluation\nmetrics. To foster accessibility and reproducibility, the Python and R code for\nimplementing our approach is available on GitHub. Its deployment will\nsignificantly facilitate the adoption of machine learning in transcriptomics\nstudies and accelerate their translation into clinically useful classifiers for\npersonalized treatment.","PeriodicalId":501425,"journal":{"name":"arXiv - STAT - Methodology","volume":"16 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Optimizing Sample Size for Supervised Machine Learning with Bulk Transcriptomic Sequencing: A Learning Curve Approach\",\"authors\":\"Yunhui Qi, Xinyi Wang, Li-Xuan Qin\",\"doi\":\"arxiv-2409.06180\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Accurate sample classification using transcriptomics data is crucial for\\nadvancing personalized medicine. Achieving this goal necessitates determining a\\nsuitable sample size that ensures adequate statistical power without undue\\nresource allocation. Current sample size calculation methods rely on\\nassumptions and algorithms that may not align with supervised machine learning\\ntechniques for sample classification. Addressing this critical methodological\\ngap, we present a novel computational approach that establishes the\\npower-versus-sample-size relationship by employing a data augmentation strategy\\nfollowed by fitting a learning curve. We comprehensively evaluated its\\nperformance for microRNA and RNA sequencing data, considering diverse data\\ncharacteristics and algorithm configurations, based on a spectrum of evaluation\\nmetrics. To foster accessibility and reproducibility, the Python and R code for\\nimplementing our approach is available on GitHub. Its deployment will\\nsignificantly facilitate the adoption of machine learning in transcriptomics\\nstudies and accelerate their translation into clinically useful classifiers for\\npersonalized treatment.\",\"PeriodicalId\":501425,\"journal\":{\"name\":\"arXiv - STAT - Methodology\",\"volume\":\"16 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - STAT - Methodology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.06180\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Methodology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06180","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

利用转录组学数据对样本进行准确分类对于推进个性化医疗至关重要。要实现这一目标，就必须确定合适的样本量，以确保在不分配过多资源的情况下获得足够的统计能力。目前的样本量计算方法所依赖的假设和算法可能与用于样本分类的监督机器学习技术不一致。针对这一关键的方法论差距，我们提出了一种新颖的计算方法，通过采用数据扩增策略和拟合学习曲线来建立统计能力与样本量之间的关系。考虑到不同的数据特征和算法配置，我们基于一系列评估指标，全面评估了该方法在 microRNA 和 RNA 测序数据方面的性能。为了提高可访问性和可重复性，我们在 GitHub 上提供了实现我们方法的 Python 和 R 代码。它的部署将极大地促进机器学习在转录组学研究中的应用，并加速将其转化为临床上有用的分类器，用于个性化治疗。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Optimizing Sample Size for Supervised Machine Learning with Bulk Transcriptomic Sequencing: A Learning Curve Approach

Accurate sample classification using transcriptomics data is crucial for advancing personalized medicine. Achieving this goal necessitates determining a suitable sample size that ensures adequate statistical power without undue resource allocation. Current sample size calculation methods rely on assumptions and algorithms that may not align with supervised machine learning techniques for sample classification. Addressing this critical methodological gap, we present a novel computational approach that establishes the power-versus-sample-size relationship by employing a data augmentation strategy followed by fitting a learning curve. We comprehensively evaluated its performance for microRNA and RNA sequencing data, considering diverse data characteristics and algorithm configurations, based on a spectrum of evaluation metrics. To foster accessibility and reproducibility, the Python and R code for implementing our approach is available on GitHub. Its deployment will significantly facilitate the adoption of machine learning in transcriptomics studies and accelerate their translation into clinically useful classifiers for personalized treatment.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - STAT - Methodology

自引率

0.00%

发文量