{"title":"性能加权投票模型:利用全外显子组测序突变进行癌症类型分类的集成机器学习方法。","authors":"Yawei Li, Yuan Luo","doi":"10.1007/s40484-020-0226-1","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>With improvements in next-generation DNA sequencing technology, lower cost is needed to collect genetic data. More machine learning techniques can be used to help with cancer analysis and diagnosis.</p><p><strong>Methods: </strong>We developed an ensemble machine learning system named performance-weighted-voting model for cancer type classification in 6,249 samples across 14 cancer types. Our ensemble system consists of five weak classifiers (logistic regression, SVM, random forest, XGBoost and neural networks). We first used cross-validation to get the predicted results for the five classifiers. The weights of the five weak classifiers can be obtained based on their predictive performance by solving linear regression functions. The final predicted probability of the performance-weighted-voting model for a cancer type can be determined by the summation of each classifier's weight multiplied by its predicted probability.</p><p><strong>Results: </strong>Using the somatic mutation count of each gene as the input feature, the overall accuracy of the performance-weighted-voting model reached 71.46%, which was significantly higher than the five weak classifiers and two other ensemble models: the hard-voting model and the soft-voting model. In addition, by analyzing the predictive pattern of the performance-weighted-voting model, we found that in most cancer types, higher tumor mutational burden can improve overall accuracy.</p><p><strong>Conclusion: </strong>This study has important clinical significance for identifying the origin of cancer, especially for those where the primary cannot be determined. In addition, our model presents a good strategy for using ensemble systems for cancer type classification.</p>","PeriodicalId":45660,"journal":{"name":"Quantitative Biology","volume":"8 4","pages":"347-358"},"PeriodicalIF":0.6000,"publicationDate":"2020-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1007/s40484-020-0226-1","citationCount":"12","resultStr":"{\"title\":\"Performance-weighted-voting model: An ensemble machine learning method for cancer type classification using whole-exome sequencing mutation.\",\"authors\":\"Yawei Li, Yuan Luo\",\"doi\":\"10.1007/s40484-020-0226-1\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>With improvements in next-generation DNA sequencing technology, lower cost is needed to collect genetic data. More machine learning techniques can be used to help with cancer analysis and diagnosis.</p><p><strong>Methods: </strong>We developed an ensemble machine learning system named performance-weighted-voting model for cancer type classification in 6,249 samples across 14 cancer types. Our ensemble system consists of five weak classifiers (logistic regression, SVM, random forest, XGBoost and neural networks). We first used cross-validation to get the predicted results for the five classifiers. The weights of the five weak classifiers can be obtained based on their predictive performance by solving linear regression functions. The final predicted probability of the performance-weighted-voting model for a cancer type can be determined by the summation of each classifier's weight multiplied by its predicted probability.</p><p><strong>Results: </strong>Using the somatic mutation count of each gene as the input feature, the overall accuracy of the performance-weighted-voting model reached 71.46%, which was significantly higher than the five weak classifiers and two other ensemble models: the hard-voting model and the soft-voting model. In addition, by analyzing the predictive pattern of the performance-weighted-voting model, we found that in most cancer types, higher tumor mutational burden can improve overall accuracy.</p><p><strong>Conclusion: </strong>This study has important clinical significance for identifying the origin of cancer, especially for those where the primary cannot be determined. In addition, our model presents a good strategy for using ensemble systems for cancer type classification.</p>\",\"PeriodicalId\":45660,\"journal\":{\"name\":\"Quantitative Biology\",\"volume\":\"8 4\",\"pages\":\"347-358\"},\"PeriodicalIF\":0.6000,\"publicationDate\":\"2020-12-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1007/s40484-020-0226-1\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Quantitative Biology\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1007/s40484-020-0226-1\",\"RegionNum\":4,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2020/12/7 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q4\",\"JCRName\":\"MATHEMATICAL & COMPUTATIONAL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Quantitative Biology","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1007/s40484-020-0226-1","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2020/12/7 0:00:00","PubModel":"Epub","JCR":"Q4","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}
Performance-weighted-voting model: An ensemble machine learning method for cancer type classification using whole-exome sequencing mutation.
Background: With improvements in next-generation DNA sequencing technology, lower cost is needed to collect genetic data. More machine learning techniques can be used to help with cancer analysis and diagnosis.
Methods: We developed an ensemble machine learning system named performance-weighted-voting model for cancer type classification in 6,249 samples across 14 cancer types. Our ensemble system consists of five weak classifiers (logistic regression, SVM, random forest, XGBoost and neural networks). We first used cross-validation to get the predicted results for the five classifiers. The weights of the five weak classifiers can be obtained based on their predictive performance by solving linear regression functions. The final predicted probability of the performance-weighted-voting model for a cancer type can be determined by the summation of each classifier's weight multiplied by its predicted probability.
Results: Using the somatic mutation count of each gene as the input feature, the overall accuracy of the performance-weighted-voting model reached 71.46%, which was significantly higher than the five weak classifiers and two other ensemble models: the hard-voting model and the soft-voting model. In addition, by analyzing the predictive pattern of the performance-weighted-voting model, we found that in most cancer types, higher tumor mutational burden can improve overall accuracy.
Conclusion: This study has important clinical significance for identifying the origin of cancer, especially for those where the primary cannot be determined. In addition, our model presents a good strategy for using ensemble systems for cancer type classification.
期刊介绍:
Quantitative Biology is an interdisciplinary journal that focuses on original research that uses quantitative approaches and technologies to analyze and integrate biological systems, construct and model engineered life systems, and gain a deeper understanding of the life sciences. It aims to provide a platform for not only the analysis but also the integration and construction of biological systems. It is a quarterly journal seeking to provide an inter- and multi-disciplinary forum for a broad blend of peer-reviewed academic papers in order to promote rapid communication and exchange between scientists in the East and the West. The content of Quantitative Biology will mainly focus on the two broad and related areas: ·bioinformatics and computational biology, which focuses on dealing with information technologies and computational methodologies that can efficiently and accurately manipulate –omics data and transform molecular information into biological knowledge. ·systems and synthetic biology, which focuses on complex interactions in biological systems and the emergent functional properties, and on the design and construction of new biological functions and systems. Its goal is to reflect the significant advances made in quantitatively investigating and modeling both natural and engineered life systems at the molecular and higher levels. The journal particularly encourages original papers that link novel theory with cutting-edge experiments, especially in the newly emerging and multi-disciplinary areas of research. The journal also welcomes high-quality reviews and perspective articles.