基于基因组数据的肉瘤分类使用机器学习模型

Procedia Computer Science Pub Date : 2025-01-01 DOI:10.1016/j.procs.2024.12.034

Pratham Gala , Yash Pandloskar , Shubham Godbole , Fayed Hakim , Pratik Kanani , Lakshmi Kurup

{"title":"基于基因组数据的肉瘤分类使用机器学习模型","authors":"Pratham Gala , Yash Pandloskar , Shubham Godbole , Fayed Hakim , Pratik Kanani , Lakshmi Kurup","doi":"10.1016/j.procs.2024.12.034","DOIUrl":null,"url":null,"abstract":"<div><div>The proposed work provides a new machine-learnt classification approach for the various types of soft tissue sarcoma based on genomics data which addresses a considerable gap in sarcoma diagnostics. The previous studies have investigated various aspects of sarcoma but this study is unique in that it targets the predicting sarcoma variant types using genetic information, which has not been done before. Random Forest was used as the meta-estimator and a stacking ensemble model comprising of Random Forest, Extreme Gradient Boosting and LightGBM were used for this study. The model which was trained and validated on a complete dataset of 206 adult soft tissue sarcoma samples containing genomic alterations, transcriptomic, epigenomic and proteomic data achieved an accuracy of 89.44% at a precision level as high as 91%. Stratified k-fold cross validation is employed to ensure that class imbalance is not a hindrance to performance. This innovative approach outmatches single classifiers and traditional single model methods at great length hence making it possible and effective to use machine learning on genomic data for predicting sarcoma variants. Thus, the findings from this research could change cancer diagnosis forever; they promise more accurate classification as well as personalized treatment modalities while also providing a framework for analogous applications in other rare complex cancers.</div></div>","PeriodicalId":20465,"journal":{"name":"Procedia Computer Science","volume":"252 ","pages":"Pages 317-330"},"PeriodicalIF":0.0000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Classification of Sarcoma Based on Genomic Data Using Machine Learning Models\",\"authors\":\"Pratham Gala , Yash Pandloskar , Shubham Godbole , Fayed Hakim , Pratik Kanani , Lakshmi Kurup\",\"doi\":\"10.1016/j.procs.2024.12.034\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The proposed work provides a new machine-learnt classification approach for the various types of soft tissue sarcoma based on genomics data which addresses a considerable gap in sarcoma diagnostics. The previous studies have investigated various aspects of sarcoma but this study is unique in that it targets the predicting sarcoma variant types using genetic information, which has not been done before. Random Forest was used as the meta-estimator and a stacking ensemble model comprising of Random Forest, Extreme Gradient Boosting and LightGBM were used for this study. The model which was trained and validated on a complete dataset of 206 adult soft tissue sarcoma samples containing genomic alterations, transcriptomic, epigenomic and proteomic data achieved an accuracy of 89.44% at a precision level as high as 91%. Stratified k-fold cross validation is employed to ensure that class imbalance is not a hindrance to performance. This innovative approach outmatches single classifiers and traditional single model methods at great length hence making it possible and effective to use machine learning on genomic data for predicting sarcoma variants. Thus, the findings from this research could change cancer diagnosis forever; they promise more accurate classification as well as personalized treatment modalities while also providing a framework for analogous applications in other rare complex cancers.</div></div>\",\"PeriodicalId\":20465,\"journal\":{\"name\":\"Procedia Computer Science\",\"volume\":\"252 \",\"pages\":\"Pages 317-330\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Procedia Computer Science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1877050924034665\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Procedia Computer Science","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1877050924034665","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

提出的工作为基于基因组学数据的各种类型的软组织肉瘤提供了一种新的机器学习分类方法，这解决了肉瘤诊断中相当大的空白。以往的研究已经研究了肉瘤的各个方面，但本研究的独特之处在于，它针对的是利用遗传信息预测肉瘤的变异类型，这是以前从未做过的。采用随机森林作为元估计量，采用随机森林、极端梯度增强和LightGBM组成的叠加集成模型进行研究。该模型在包含基因组改变、转录组学、表观基因组学和蛋白质组学数据的206个成人软组织肉瘤样本完整数据集上进行训练和验证，准确率达到89.44%，精度水平高达91%。采用分层k-fold交叉验证来确保类不平衡不会妨碍性能。这种创新的方法在很大程度上优于单一分类器和传统的单一模型方法，从而使机器学习在基因组数据上预测肉瘤变异成为可能和有效。因此，这项研究的发现可能永远改变癌症的诊断；他们承诺更准确的分类和个性化的治疗方式，同时也为其他罕见的复杂癌症的类似应用提供了一个框架。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Classification of Sarcoma Based on Genomic Data Using Machine Learning Models

The proposed work provides a new machine-learnt classification approach for the various types of soft tissue sarcoma based on genomics data which addresses a considerable gap in sarcoma diagnostics. The previous studies have investigated various aspects of sarcoma but this study is unique in that it targets the predicting sarcoma variant types using genetic information, which has not been done before. Random Forest was used as the meta-estimator and a stacking ensemble model comprising of Random Forest, Extreme Gradient Boosting and LightGBM were used for this study. The model which was trained and validated on a complete dataset of 206 adult soft tissue sarcoma samples containing genomic alterations, transcriptomic, epigenomic and proteomic data achieved an accuracy of 89.44% at a precision level as high as 91%. Stratified k-fold cross validation is employed to ensure that class imbalance is not a hindrance to performance. This innovative approach outmatches single classifiers and traditional single model methods at great length hence making it possible and effective to use machine learning on genomic data for predicting sarcoma variants. Thus, the findings from this research could change cancer diagnosis forever; they promise more accurate classification as well as personalized treatment modalities while also providing a framework for analogous applications in other rare complex cancers.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Procedia Computer Science

CiteScore

4.50

自引率

0.00%

发文量