Performance-weighted-voting model: An ensemble machine learning method for cancer type classification using whole-exome sequencing mutation.

IF 1.4 4区生物学 Q4 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Quantitative Biology Pub Date : 2020-12-24 Epub Date: 2020-12-07 DOI:10.1007/s40484-020-0226-1

Yawei Li, Yuan Luo

{"title":"Performance-weighted-voting model: An ensemble machine learning method for cancer type classification using whole-exome sequencing mutation.","authors":"Yawei Li, Yuan Luo","doi":"10.1007/s40484-020-0226-1","DOIUrl":null,"url":null,"abstract":"Background: With improvements in next-generation DNA sequencing technology, lower cost is needed to collect genetic data. More machine learning techniques can be used to help with cancer analysis and diagnosis.Methods: We developed an ensemble machine learning system named performance-weighted-voting model for cancer type classification in 6,249 samples across 14 cancer types. Our ensemble system consists of five weak classifiers (logistic regression, SVM, random forest, XGBoost and neural networks). We first used cross-validation to get the predicted results for the five classifiers. The weights of the five weak classifiers can be obtained based on their predictive performance by solving linear regression functions. The final predicted probability of the performance-weighted-voting model for a cancer type can be determined by the summation of each classifier's weight multiplied by its predicted probability.Results: Using the somatic mutation count of each gene as the input feature, the overall accuracy of the performance-weighted-voting model reached 71.46%, which was significantly higher than the five weak classifiers and two other ensemble models: the hard-voting model and the soft-voting model. In addition, by analyzing the predictive pattern of the performance-weighted-voting model, we found that in most cancer types, higher tumor mutational burden can improve overall accuracy.Conclusion: This study has important clinical significance for identifying the origin of cancer, especially for those where the primary cannot be determined. In addition, our model presents a good strategy for using ensemble systems for cancer type classification.","PeriodicalId":45660,"journal":{"name":"Quantitative Biology","volume":"8 4","pages":"347-358"},"PeriodicalIF":1.4000,"publicationDate":"2020-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1007/s40484-020-0226-1","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Quantitative Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1007/s40484-020-0226-1","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2020/12/7 0:00:00","PubModel":"Epub","JCR":"Q4","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 12

Abstract

Background: With improvements in next-generation DNA sequencing technology, lower cost is needed to collect genetic data. More machine learning techniques can be used to help with cancer analysis and diagnosis.

Methods: We developed an ensemble machine learning system named performance-weighted-voting model for cancer type classification in 6,249 samples across 14 cancer types. Our ensemble system consists of five weak classifiers (logistic regression, SVM, random forest, XGBoost and neural networks). We first used cross-validation to get the predicted results for the five classifiers. The weights of the five weak classifiers can be obtained based on their predictive performance by solving linear regression functions. The final predicted probability of the performance-weighted-voting model for a cancer type can be determined by the summation of each classifier's weight multiplied by its predicted probability.

Results: Using the somatic mutation count of each gene as the input feature, the overall accuracy of the performance-weighted-voting model reached 71.46%, which was significantly higher than the five weak classifiers and two other ensemble models: the hard-voting model and the soft-voting model. In addition, by analyzing the predictive pattern of the performance-weighted-voting model, we found that in most cancer types, higher tumor mutational burden can improve overall accuracy.

Conclusion: This study has important clinical significance for identifying the origin of cancer, especially for those where the primary cannot be determined. In addition, our model presents a good strategy for using ensemble systems for cancer type classification.

查看原文本刊更多论文

性能加权投票模型:利用全外显子组测序突变进行癌症类型分类的集成机器学习方法。

背景:随着下一代DNA测序技术的进步，需要更低的成本来收集基因数据。更多的机器学习技术可以用来帮助癌症分析和诊断。方法:我们开发了一个名为性能加权投票模型的集成机器学习系统，用于14种癌症类型的6249个样本的癌症类型分类。我们的集成系统由五个弱分类器(逻辑回归、支持向量机、随机森林、XGBoost和神经网络)组成。我们首先使用交叉验证来获得五个分类器的预测结果。通过求解线性回归函数，可以根据五个弱分类器的预测性能得到其权重。性能加权投票模型对癌症类型的最终预测概率可以通过每个分类器的权重乘以其预测概率的总和来确定。结果:以每个基因的体细胞突变数作为输入特征，性能加权投票模型的整体准确率达到71.46%，显著高于5个弱分类器和另外两种集成模型:硬投票模型和软投票模型。此外，通过分析性能加权投票模型的预测模式，我们发现在大多数癌症类型中，更高的肿瘤突变负担可以提高整体准确性。结论:本研究对于鉴别肿瘤的起源，特别是对于原发不能确定的肿瘤，具有重要的临床意义。此外，我们的模型为使用集成系统进行癌症类型分类提供了一个很好的策略。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Quantitative Biology MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

5.00

自引率

3.20%

发文量

264

期刊介绍： Quantitative Biology is an interdisciplinary journal that focuses on original research that uses quantitative approaches and technologies to analyze and integrate biological systems, construct and model engineered life systems, and gain a deeper understanding of the life sciences. It aims to provide a platform for not only the analysis but also the integration and construction of biological systems. It is a quarterly journal seeking to provide an inter- and multi-disciplinary forum for a broad blend of peer-reviewed academic papers in order to promote rapid communication and exchange between scientists in the East and the West. The content of Quantitative Biology will mainly focus on the two broad and related areas: ·bioinformatics and computational biology, which focuses on dealing with information technologies and computational methodologies that can efficiently and accurately manipulate –omics data and transform molecular information into biological knowledge. ·systems and synthetic biology, which focuses on complex interactions in biological systems and the emergent functional properties, and on the design and construction of new biological functions and systems. Its goal is to reflect the significant advances made in quantitatively investigating and modeling both natural and engineered life systems at the molecular and higher levels. The journal particularly encourages original papers that link novel theory with cutting-edge experiments, especially in the newly emerging and multi-disciplinary areas of research. The journal also welcomes high-quality reviews and perspective articles.