新型体细胞突变对癌症预后影响的预测建模:使用 COSMIC 数据库的机器学习方法

Masab A. Mansoor, Dba
{"title":"新型体细胞突变对癌症预后影响的预测建模:使用 COSMIC 数据库的机器学习方法","authors":"Masab A. Mansoor, Dba","doi":"10.1101/2024.08.10.24311796","DOIUrl":null,"url":null,"abstract":"Abstract Background Somatic mutations play a crucial role in cancer initiation, progression, and treatment response. While high-throughput sequencing has vastly expanded our understanding of cancer genomics, interpreting the functional impact of novel somatic mutations remains challenging. Machine learning approaches show promise in predicting mutation impacts, but robust models for accurate prognosis across different cancer types are still needed. Objective This study aimed to develop and validate a machine learning model using the Catalogue of Somatic Mutations in Cancer (COSMIC) database to predict the functional impact of novel somatic mutations on cancer prognosis across various cancer types. Methods We extracted data on 6,573,214 coding point mutations across 1,391 cancer types from COSMIC v95. We engineered 47 features for each mutation, including sequence context, protein domain information, evolutionary conservation scores, and frequency data. We developed and compared Random Forest, XGBoost, and Deep Neural Network models, selecting XGBoost based on performance. The model was evaluated using standard metrics and externally validated using data from The Cancer Genome Atlas (TCGA). Results The XGBoost model achieved an area under the Receiver Operating Characteristic curve (AUC-ROC) of 0.89 on the test set and 0.86 on the TCGA validation set. The model demonstrated consistent performance across major cancer types (AUC-ROC range: 0.85-0.92). Key predictive features included evolutionary conservation score, protein domain disruption, and mutation frequency. The model correctly identified 87% of known driver mutations and predicted 3,241 potentially high-impact novel mutations. Model predictions significantly correlated with patient survival in the TCGA dataset (HR = 1.8, 95% CI: 1.6-2.0, p < 0.001). Conclusions Our machine learning model shows strong predictive power in assessing the functional impact of somatic mutations on cancer prognosis across various cancer types. This approach has potential applications in research prioritization and clinical decision support, contributing to the advancement of precision oncology. Keywords cancer genomics; somatic mutations; machine learning; prognosis prediction; COSMIC database; precision oncology","PeriodicalId":18505,"journal":{"name":"medRxiv","volume":"16 5","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Predictive Modeling of Novel Somatic Mutation Impacts on Cancer Prognosis: A Machine Learning Approach Using the COSMIC Database\",\"authors\":\"Masab A. Mansoor, Dba\",\"doi\":\"10.1101/2024.08.10.24311796\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract Background Somatic mutations play a crucial role in cancer initiation, progression, and treatment response. While high-throughput sequencing has vastly expanded our understanding of cancer genomics, interpreting the functional impact of novel somatic mutations remains challenging. Machine learning approaches show promise in predicting mutation impacts, but robust models for accurate prognosis across different cancer types are still needed. Objective This study aimed to develop and validate a machine learning model using the Catalogue of Somatic Mutations in Cancer (COSMIC) database to predict the functional impact of novel somatic mutations on cancer prognosis across various cancer types. Methods We extracted data on 6,573,214 coding point mutations across 1,391 cancer types from COSMIC v95. We engineered 47 features for each mutation, including sequence context, protein domain information, evolutionary conservation scores, and frequency data. We developed and compared Random Forest, XGBoost, and Deep Neural Network models, selecting XGBoost based on performance. The model was evaluated using standard metrics and externally validated using data from The Cancer Genome Atlas (TCGA). Results The XGBoost model achieved an area under the Receiver Operating Characteristic curve (AUC-ROC) of 0.89 on the test set and 0.86 on the TCGA validation set. The model demonstrated consistent performance across major cancer types (AUC-ROC range: 0.85-0.92). Key predictive features included evolutionary conservation score, protein domain disruption, and mutation frequency. The model correctly identified 87% of known driver mutations and predicted 3,241 potentially high-impact novel mutations. Model predictions significantly correlated with patient survival in the TCGA dataset (HR = 1.8, 95% CI: 1.6-2.0, p < 0.001). Conclusions Our machine learning model shows strong predictive power in assessing the functional impact of somatic mutations on cancer prognosis across various cancer types. This approach has potential applications in research prioritization and clinical decision support, contributing to the advancement of precision oncology. Keywords cancer genomics; somatic mutations; machine learning; prognosis prediction; COSMIC database; precision oncology\",\"PeriodicalId\":18505,\"journal\":{\"name\":\"medRxiv\",\"volume\":\"16 5\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"medRxiv\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1101/2024.08.10.24311796\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"medRxiv","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.08.10.24311796","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

摘要 背景 体细胞突变在癌症的发生、发展和治疗反应中起着至关重要的作用。虽然高通量测序极大地扩展了我们对癌症基因组学的了解,但解读新型体细胞突变的功能影响仍具有挑战性。机器学习方法有望预测突变的影响,但仍需建立强大的模型来准确预测不同癌症类型的预后。目的 本研究旨在利用癌症体细胞突变目录(COSMIC)数据库开发并验证一种机器学习模型,以预测新型体细胞突变对不同癌症类型预后的功能性影响。方法 我们从 COSMIC v95 中提取了 1,391 种癌症类型中 6,573,214 个编码点突变的数据。我们为每个突变设计了 47 个特征,包括序列上下文、蛋白质域信息、进化保护评分和频率数据。我们开发并比较了随机森林模型、XGBoost 模型和深度神经网络模型,并根据性能选择了 XGBoost 模型。我们使用标准指标对模型进行了评估,并使用癌症基因组图谱(TCGA)的数据进行了外部验证。结果 XGBoost 模型在测试集上的接收者操作特征曲线下面积(AUC-ROC)为 0.89,在 TCGA 验证集上为 0.86。该模型在主要癌症类型中表现出一致的性能(AUC-ROC 范围:0.85-0.92)。主要预测特征包括进化保护得分、蛋白质结构域中断和突变频率。该模型正确识别了87%的已知驱动突变,并预测了3241个潜在的高影响新型突变。在 TCGA 数据集中,模型预测结果与患者生存率明显相关(HR = 1.8,95% CI:1.6-2.0,p < 0.001)。结论 我们的机器学习模型在评估体细胞突变对各种癌症预后的功能性影响方面显示出很强的预测能力。这种方法有望应用于研究优先级排序和临床决策支持,促进精准肿瘤学的发展。关键词 癌症基因组学;体细胞突变;机器学习;预后预测;COSMIC 数据库;精准肿瘤学
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Predictive Modeling of Novel Somatic Mutation Impacts on Cancer Prognosis: A Machine Learning Approach Using the COSMIC Database
Abstract Background Somatic mutations play a crucial role in cancer initiation, progression, and treatment response. While high-throughput sequencing has vastly expanded our understanding of cancer genomics, interpreting the functional impact of novel somatic mutations remains challenging. Machine learning approaches show promise in predicting mutation impacts, but robust models for accurate prognosis across different cancer types are still needed. Objective This study aimed to develop and validate a machine learning model using the Catalogue of Somatic Mutations in Cancer (COSMIC) database to predict the functional impact of novel somatic mutations on cancer prognosis across various cancer types. Methods We extracted data on 6,573,214 coding point mutations across 1,391 cancer types from COSMIC v95. We engineered 47 features for each mutation, including sequence context, protein domain information, evolutionary conservation scores, and frequency data. We developed and compared Random Forest, XGBoost, and Deep Neural Network models, selecting XGBoost based on performance. The model was evaluated using standard metrics and externally validated using data from The Cancer Genome Atlas (TCGA). Results The XGBoost model achieved an area under the Receiver Operating Characteristic curve (AUC-ROC) of 0.89 on the test set and 0.86 on the TCGA validation set. The model demonstrated consistent performance across major cancer types (AUC-ROC range: 0.85-0.92). Key predictive features included evolutionary conservation score, protein domain disruption, and mutation frequency. The model correctly identified 87% of known driver mutations and predicted 3,241 potentially high-impact novel mutations. Model predictions significantly correlated with patient survival in the TCGA dataset (HR = 1.8, 95% CI: 1.6-2.0, p < 0.001). Conclusions Our machine learning model shows strong predictive power in assessing the functional impact of somatic mutations on cancer prognosis across various cancer types. This approach has potential applications in research prioritization and clinical decision support, contributing to the advancement of precision oncology. Keywords cancer genomics; somatic mutations; machine learning; prognosis prediction; COSMIC database; precision oncology
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信