{"title":"一种用于癌症预测的新型三步转录组学框架","authors":"Rushank Goyal","doi":"10.1145/3535508.3545098","DOIUrl":null,"url":null,"abstract":"Cancer is a broad term for diseases characterized by uncontrollable and abnormal cell growth. With 19.3 million new cases and 10 million cancer-related deaths per annum, it is the second-leading cause of death worldwide [4]. As a method of cancer detection, tools known as microarrays --- which develop a transcriptome, i.e. a rapid and systematic profile of the expression of a large number of genes at once --- are often used to identify cancerous cells [1]. However, prior research has utilized \"black-box\" algorithms, which are not appropriate for use in the life sciences [3]. In this study, a novel three-step framework was developed that combines the principles of biostatistics with transparent machine learning to create mathematical equations that predict cancer diagnoses using gene expression levels. First, an XGBoost model is trained on the training set, and the features with nonzero feature importances are carried onto the next step, where only genes that show a statistically significant difference (α=0.05) between expression patterns in cancerous and non-cancerous samples are retained. Finally, a novel symbolic regression-based algorithm called the QLattice (short for 'Quantum Lattice') is trained on the remaining features for 10 epochs using the Akaike Information Criterion as its loss function [2]. Table 1: Performance and Identified Biomarkers by Cancer To evaluate its performance, the framework was trained and tested on three datasets containing transcriptome profiles from cancerous and non-cancerous tissue for three different cancer types --- acute myeloid leukemia (AML), non-small cell lung cancer (NSCLC), and clear cell renal cell carcinoma (ccRCC). Table 1 shows the accuracies attained for each type as well as the biomarkers used in the mathematical expression (which together serve as a predictive gene signature), where an asterisk indicates that the gene has not been associated with that cancer type in previous literature. It should be noted that only three or four genes' expression levels are used in each case, while prior work has tended to use hundreds [1].","PeriodicalId":354504,"journal":{"name":"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A novel three-step transcriptomic framework for cancer prediction\",\"authors\":\"Rushank Goyal\",\"doi\":\"10.1145/3535508.3545098\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Cancer is a broad term for diseases characterized by uncontrollable and abnormal cell growth. With 19.3 million new cases and 10 million cancer-related deaths per annum, it is the second-leading cause of death worldwide [4]. As a method of cancer detection, tools known as microarrays --- which develop a transcriptome, i.e. a rapid and systematic profile of the expression of a large number of genes at once --- are often used to identify cancerous cells [1]. However, prior research has utilized \\\"black-box\\\" algorithms, which are not appropriate for use in the life sciences [3]. In this study, a novel three-step framework was developed that combines the principles of biostatistics with transparent machine learning to create mathematical equations that predict cancer diagnoses using gene expression levels. First, an XGBoost model is trained on the training set, and the features with nonzero feature importances are carried onto the next step, where only genes that show a statistically significant difference (α=0.05) between expression patterns in cancerous and non-cancerous samples are retained. Finally, a novel symbolic regression-based algorithm called the QLattice (short for 'Quantum Lattice') is trained on the remaining features for 10 epochs using the Akaike Information Criterion as its loss function [2]. Table 1: Performance and Identified Biomarkers by Cancer To evaluate its performance, the framework was trained and tested on three datasets containing transcriptome profiles from cancerous and non-cancerous tissue for three different cancer types --- acute myeloid leukemia (AML), non-small cell lung cancer (NSCLC), and clear cell renal cell carcinoma (ccRCC). Table 1 shows the accuracies attained for each type as well as the biomarkers used in the mathematical expression (which together serve as a predictive gene signature), where an asterisk indicates that the gene has not been associated with that cancer type in previous literature. It should be noted that only three or four genes' expression levels are used in each case, while prior work has tended to use hundreds [1].\",\"PeriodicalId\":354504,\"journal\":{\"name\":\"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics\",\"volume\":\"33 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-08-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3535508.3545098\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3535508.3545098","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A novel three-step transcriptomic framework for cancer prediction
Cancer is a broad term for diseases characterized by uncontrollable and abnormal cell growth. With 19.3 million new cases and 10 million cancer-related deaths per annum, it is the second-leading cause of death worldwide [4]. As a method of cancer detection, tools known as microarrays --- which develop a transcriptome, i.e. a rapid and systematic profile of the expression of a large number of genes at once --- are often used to identify cancerous cells [1]. However, prior research has utilized "black-box" algorithms, which are not appropriate for use in the life sciences [3]. In this study, a novel three-step framework was developed that combines the principles of biostatistics with transparent machine learning to create mathematical equations that predict cancer diagnoses using gene expression levels. First, an XGBoost model is trained on the training set, and the features with nonzero feature importances are carried onto the next step, where only genes that show a statistically significant difference (α=0.05) between expression patterns in cancerous and non-cancerous samples are retained. Finally, a novel symbolic regression-based algorithm called the QLattice (short for 'Quantum Lattice') is trained on the remaining features for 10 epochs using the Akaike Information Criterion as its loss function [2]. Table 1: Performance and Identified Biomarkers by Cancer To evaluate its performance, the framework was trained and tested on three datasets containing transcriptome profiles from cancerous and non-cancerous tissue for three different cancer types --- acute myeloid leukemia (AML), non-small cell lung cancer (NSCLC), and clear cell renal cell carcinoma (ccRCC). Table 1 shows the accuracies attained for each type as well as the biomarkers used in the mathematical expression (which together serve as a predictive gene signature), where an asterisk indicates that the gene has not been associated with that cancer type in previous literature. It should be noted that only three or four genes' expression levels are used in each case, while prior work has tended to use hundreds [1].