应对极小样本量蛋白质组学数据分析中的统计挑战:一项模拟研究。

IF 3.5 2区 生物学 Q2 BIOTECHNOLOGY & APPLIED MICROBIOLOGY
Kyung Hyun Lee, Shervin Assassi, Chandra Mohan, Claudia Pedroza
{"title":"应对极小样本量蛋白质组学数据分析中的统计挑战:一项模拟研究。","authors":"Kyung Hyun Lee, Shervin Assassi, Chandra Mohan, Claudia Pedroza","doi":"10.1186/s12864-024-11018-2","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>One of the most promising approaches for early and more precise disease prediction and diagnosis is through the inclusion of proteomics data augmented with clinical data. Clinical proteomics data is often characterized by its high dimensionality and extremely limited sample size, posing a significant challenge when employing machine learning techniques for extracting only the most relevant information. Although there is a wide array of statistical techniques and numerous analysis pipelines employed in proteomics data analysis, it is unclear which of these methods produce the most efficient, reproducible, and clinically meaningful results.</p><p><strong>Results: </strong>In this study, we compared 9 unique analysis schemes comprised of different machine learning and dimensionality reduction methods for the analysis of simulated proteomics data consisting of 1317 proteins measured in 26 subjects (i.e., 13 controls and 13 cases). In scenarios where the sample size is extremely small (i.e., n < 30), all schemes resulted in an exceptionally high level of performance metrics, indicating potential overfitting. While performance metrics did not exhibit significant differences across schemes, the set of proteins selected to be discriminatory between groups demonstrated a substantial level of heterogeneity. However, despite heterogeneity in the selected proteins, their biological pathways and genetic diseases exhibited similarities. A sensitivity analysis conducted using varying sample sizes indicated that the stability of a set of selected biomarkers improves with larger sample sizes within a scheme.</p><p><strong>Conclusions: </strong>When the aim of the study is to identify a statistical model that best distinguishes between cohort groups using proteomics data and to uncover the biological pathways and disorders common among the selected proteins, the majority of widely used analysis pipelines perform similarly. However, if the main objective is to pinpoint a set of selected proteins that wield significant influence in discriminating cohort groups and utilize them for subsequent investigations, meticulous consideration is necessary when opting for statistical models, due to the possibility of heterogeneity in the sets of selected proteins.</p>","PeriodicalId":9030,"journal":{"name":"BMC Genomics","volume":"25 1","pages":"1086"},"PeriodicalIF":3.5000,"publicationDate":"2024-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11566501/pdf/","citationCount":"0","resultStr":"{\"title\":\"Addressing statistical challenges in the analysis of proteomics data with extremely small sample size: a simulation study.\",\"authors\":\"Kyung Hyun Lee, Shervin Assassi, Chandra Mohan, Claudia Pedroza\",\"doi\":\"10.1186/s12864-024-11018-2\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>One of the most promising approaches for early and more precise disease prediction and diagnosis is through the inclusion of proteomics data augmented with clinical data. Clinical proteomics data is often characterized by its high dimensionality and extremely limited sample size, posing a significant challenge when employing machine learning techniques for extracting only the most relevant information. Although there is a wide array of statistical techniques and numerous analysis pipelines employed in proteomics data analysis, it is unclear which of these methods produce the most efficient, reproducible, and clinically meaningful results.</p><p><strong>Results: </strong>In this study, we compared 9 unique analysis schemes comprised of different machine learning and dimensionality reduction methods for the analysis of simulated proteomics data consisting of 1317 proteins measured in 26 subjects (i.e., 13 controls and 13 cases). In scenarios where the sample size is extremely small (i.e., n < 30), all schemes resulted in an exceptionally high level of performance metrics, indicating potential overfitting. While performance metrics did not exhibit significant differences across schemes, the set of proteins selected to be discriminatory between groups demonstrated a substantial level of heterogeneity. However, despite heterogeneity in the selected proteins, their biological pathways and genetic diseases exhibited similarities. A sensitivity analysis conducted using varying sample sizes indicated that the stability of a set of selected biomarkers improves with larger sample sizes within a scheme.</p><p><strong>Conclusions: </strong>When the aim of the study is to identify a statistical model that best distinguishes between cohort groups using proteomics data and to uncover the biological pathways and disorders common among the selected proteins, the majority of widely used analysis pipelines perform similarly. However, if the main objective is to pinpoint a set of selected proteins that wield significant influence in discriminating cohort groups and utilize them for subsequent investigations, meticulous consideration is necessary when opting for statistical models, due to the possibility of heterogeneity in the sets of selected proteins.</p>\",\"PeriodicalId\":9030,\"journal\":{\"name\":\"BMC Genomics\",\"volume\":\"25 1\",\"pages\":\"1086\"},\"PeriodicalIF\":3.5000,\"publicationDate\":\"2024-11-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11566501/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMC Genomics\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1186/s12864-024-11018-2\",\"RegionNum\":2,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"BIOTECHNOLOGY & APPLIED MICROBIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Genomics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12864-024-11018-2","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOTECHNOLOGY & APPLIED MICROBIOLOGY","Score":null,"Total":0}
引用次数: 0

摘要

背景:将蛋白质组学数据与临床数据相结合,是进行早期和更精确疾病预测与诊断的最有前途的方法之一。临床蛋白质组学数据通常具有高维度和样本量极其有限的特点,这给采用机器学习技术提取最相关信息带来了巨大挑战。虽然在蛋白质组学数据分析中使用了大量的统计技术和分析管道,但目前还不清楚哪种方法能产生最有效、可重复和有临床意义的结果:在这项研究中,我们比较了 9 种独特的分析方案,这些方案由不同的机器学习和降维方法组成,用于分析模拟蛋白质组学数据,这些数据包括在 26 个受试者(即 13 个对照组和 13 个病例)中测量的 1317 个蛋白质。在样本量极小的情况下(即 n 个结论:当研究的目的是利用蛋白质组学数据找出最能区分同组人群的统计模型,并发现所选蛋白质中常见的生物通路和疾病时,大多数广泛使用的分析管道都有类似的表现。然而,如果主要目的是找出一组在区分队列组别方面具有显著影响的选定蛋白质,并将其用于后续研究,那么在选择统计模型时就必须进行缜密的考虑,因为选定蛋白质组中可能存在异质性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Addressing statistical challenges in the analysis of proteomics data with extremely small sample size: a simulation study.

Background: One of the most promising approaches for early and more precise disease prediction and diagnosis is through the inclusion of proteomics data augmented with clinical data. Clinical proteomics data is often characterized by its high dimensionality and extremely limited sample size, posing a significant challenge when employing machine learning techniques for extracting only the most relevant information. Although there is a wide array of statistical techniques and numerous analysis pipelines employed in proteomics data analysis, it is unclear which of these methods produce the most efficient, reproducible, and clinically meaningful results.

Results: In this study, we compared 9 unique analysis schemes comprised of different machine learning and dimensionality reduction methods for the analysis of simulated proteomics data consisting of 1317 proteins measured in 26 subjects (i.e., 13 controls and 13 cases). In scenarios where the sample size is extremely small (i.e., n < 30), all schemes resulted in an exceptionally high level of performance metrics, indicating potential overfitting. While performance metrics did not exhibit significant differences across schemes, the set of proteins selected to be discriminatory between groups demonstrated a substantial level of heterogeneity. However, despite heterogeneity in the selected proteins, their biological pathways and genetic diseases exhibited similarities. A sensitivity analysis conducted using varying sample sizes indicated that the stability of a set of selected biomarkers improves with larger sample sizes within a scheme.

Conclusions: When the aim of the study is to identify a statistical model that best distinguishes between cohort groups using proteomics data and to uncover the biological pathways and disorders common among the selected proteins, the majority of widely used analysis pipelines perform similarly. However, if the main objective is to pinpoint a set of selected proteins that wield significant influence in discriminating cohort groups and utilize them for subsequent investigations, meticulous consideration is necessary when opting for statistical models, due to the possibility of heterogeneity in the sets of selected proteins.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
BMC Genomics
BMC Genomics 生物-生物工程与应用微生物
CiteScore
7.40
自引率
4.50%
发文量
769
审稿时长
6.4 months
期刊介绍: BMC Genomics is an open access, peer-reviewed journal that considers articles on all aspects of genome-scale analysis, functional genomics, and proteomics. BMC Genomics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信