Performance-weighted-voting model: An ensemble machine learning method for cancer type classification using whole-exome sequencing mutation.

IF 0.6 4区 生物学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY
Quantitative Biology Pub Date : 2020-12-24 Epub Date: 2020-12-07 DOI:10.1007/s40484-020-0226-1
Yawei Li, Yuan Luo
{"title":"Performance-weighted-voting model: An ensemble machine learning method for cancer type classification using whole-exome sequencing mutation.","authors":"Yawei Li,&nbsp;Yuan Luo","doi":"10.1007/s40484-020-0226-1","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>With improvements in next-generation DNA sequencing technology, lower cost is needed to collect genetic data. More machine learning techniques can be used to help with cancer analysis and diagnosis.</p><p><strong>Methods: </strong>We developed an ensemble machine learning system named performance-weighted-voting model for cancer type classification in 6,249 samples across 14 cancer types. Our ensemble system consists of five weak classifiers (logistic regression, SVM, random forest, XGBoost and neural networks). We first used cross-validation to get the predicted results for the five classifiers. The weights of the five weak classifiers can be obtained based on their predictive performance by solving linear regression functions. The final predicted probability of the performance-weighted-voting model for a cancer type can be determined by the summation of each classifier's weight multiplied by its predicted probability.</p><p><strong>Results: </strong>Using the somatic mutation count of each gene as the input feature, the overall accuracy of the performance-weighted-voting model reached 71.46%, which was significantly higher than the five weak classifiers and two other ensemble models: the hard-voting model and the soft-voting model. In addition, by analyzing the predictive pattern of the performance-weighted-voting model, we found that in most cancer types, higher tumor mutational burden can improve overall accuracy.</p><p><strong>Conclusion: </strong>This study has important clinical significance for identifying the origin of cancer, especially for those where the primary cannot be determined. In addition, our model presents a good strategy for using ensemble systems for cancer type classification.</p>","PeriodicalId":45660,"journal":{"name":"Quantitative Biology","volume":"8 4","pages":"347-358"},"PeriodicalIF":0.6000,"publicationDate":"2020-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1007/s40484-020-0226-1","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Quantitative Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1007/s40484-020-0226-1","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2020/12/7 0:00:00","PubModel":"Epub","JCR":"Q4","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
引用次数: 12

Abstract

Background: With improvements in next-generation DNA sequencing technology, lower cost is needed to collect genetic data. More machine learning techniques can be used to help with cancer analysis and diagnosis.

Methods: We developed an ensemble machine learning system named performance-weighted-voting model for cancer type classification in 6,249 samples across 14 cancer types. Our ensemble system consists of five weak classifiers (logistic regression, SVM, random forest, XGBoost and neural networks). We first used cross-validation to get the predicted results for the five classifiers. The weights of the five weak classifiers can be obtained based on their predictive performance by solving linear regression functions. The final predicted probability of the performance-weighted-voting model for a cancer type can be determined by the summation of each classifier's weight multiplied by its predicted probability.

Results: Using the somatic mutation count of each gene as the input feature, the overall accuracy of the performance-weighted-voting model reached 71.46%, which was significantly higher than the five weak classifiers and two other ensemble models: the hard-voting model and the soft-voting model. In addition, by analyzing the predictive pattern of the performance-weighted-voting model, we found that in most cancer types, higher tumor mutational burden can improve overall accuracy.

Conclusion: This study has important clinical significance for identifying the origin of cancer, especially for those where the primary cannot be determined. In addition, our model presents a good strategy for using ensemble systems for cancer type classification.

性能加权投票模型:利用全外显子组测序突变进行癌症类型分类的集成机器学习方法。
背景:随着下一代DNA测序技术的进步,需要更低的成本来收集基因数据。更多的机器学习技术可以用来帮助癌症分析和诊断。方法:我们开发了一个名为性能加权投票模型的集成机器学习系统,用于14种癌症类型的6249个样本的癌症类型分类。我们的集成系统由五个弱分类器(逻辑回归、支持向量机、随机森林、XGBoost和神经网络)组成。我们首先使用交叉验证来获得五个分类器的预测结果。通过求解线性回归函数,可以根据五个弱分类器的预测性能得到其权重。性能加权投票模型对癌症类型的最终预测概率可以通过每个分类器的权重乘以其预测概率的总和来确定。结果:以每个基因的体细胞突变数作为输入特征,性能加权投票模型的整体准确率达到71.46%,显著高于5个弱分类器和另外两种集成模型:硬投票模型和软投票模型。此外,通过分析性能加权投票模型的预测模式,我们发现在大多数癌症类型中,更高的肿瘤突变负担可以提高整体准确性。结论:本研究对于鉴别肿瘤的起源,特别是对于原发不能确定的肿瘤,具有重要的临床意义。此外,我们的模型为使用集成系统进行癌症类型分类提供了一个很好的策略。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Quantitative Biology
Quantitative Biology MATHEMATICAL & COMPUTATIONAL BIOLOGY-
CiteScore
5.00
自引率
3.20%
发文量
264
期刊介绍: Quantitative Biology is an interdisciplinary journal that focuses on original research that uses quantitative approaches and technologies to analyze and integrate biological systems, construct and model engineered life systems, and gain a deeper understanding of the life sciences. It aims to provide a platform for not only the analysis but also the integration and construction of biological systems. It is a quarterly journal seeking to provide an inter- and multi-disciplinary forum for a broad blend of peer-reviewed academic papers in order to promote rapid communication and exchange between scientists in the East and the West. The content of Quantitative Biology will mainly focus on the two broad and related areas: ·bioinformatics and computational biology, which focuses on dealing with information technologies and computational methodologies that can efficiently and accurately manipulate –omics data and transform molecular information into biological knowledge. ·systems and synthetic biology, which focuses on complex interactions in biological systems and the emergent functional properties, and on the design and construction of new biological functions and systems. Its goal is to reflect the significant advances made in quantitatively investigating and modeling both natural and engineered life systems at the molecular and higher levels. The journal particularly encourages original papers that link novel theory with cutting-edge experiments, especially in the newly emerging and multi-disciplinary areas of research. The journal also welcomes high-quality reviews and perspective articles.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信