一种用于癌症预测的新型三步转录组学框架

Rushank Goyal
{"title":"一种用于癌症预测的新型三步转录组学框架","authors":"Rushank Goyal","doi":"10.1145/3535508.3545098","DOIUrl":null,"url":null,"abstract":"Cancer is a broad term for diseases characterized by uncontrollable and abnormal cell growth. With 19.3 million new cases and 10 million cancer-related deaths per annum, it is the second-leading cause of death worldwide [4]. As a method of cancer detection, tools known as microarrays --- which develop a transcriptome, i.e. a rapid and systematic profile of the expression of a large number of genes at once --- are often used to identify cancerous cells [1]. However, prior research has utilized \"black-box\" algorithms, which are not appropriate for use in the life sciences [3]. In this study, a novel three-step framework was developed that combines the principles of biostatistics with transparent machine learning to create mathematical equations that predict cancer diagnoses using gene expression levels. First, an XGBoost model is trained on the training set, and the features with nonzero feature importances are carried onto the next step, where only genes that show a statistically significant difference (α=0.05) between expression patterns in cancerous and non-cancerous samples are retained. Finally, a novel symbolic regression-based algorithm called the QLattice (short for 'Quantum Lattice') is trained on the remaining features for 10 epochs using the Akaike Information Criterion as its loss function [2]. Table 1: Performance and Identified Biomarkers by Cancer To evaluate its performance, the framework was trained and tested on three datasets containing transcriptome profiles from cancerous and non-cancerous tissue for three different cancer types --- acute myeloid leukemia (AML), non-small cell lung cancer (NSCLC), and clear cell renal cell carcinoma (ccRCC). Table 1 shows the accuracies attained for each type as well as the biomarkers used in the mathematical expression (which together serve as a predictive gene signature), where an asterisk indicates that the gene has not been associated with that cancer type in previous literature. It should be noted that only three or four genes' expression levels are used in each case, while prior work has tended to use hundreds [1].","PeriodicalId":354504,"journal":{"name":"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A novel three-step transcriptomic framework for cancer prediction\",\"authors\":\"Rushank Goyal\",\"doi\":\"10.1145/3535508.3545098\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Cancer is a broad term for diseases characterized by uncontrollable and abnormal cell growth. With 19.3 million new cases and 10 million cancer-related deaths per annum, it is the second-leading cause of death worldwide [4]. As a method of cancer detection, tools known as microarrays --- which develop a transcriptome, i.e. a rapid and systematic profile of the expression of a large number of genes at once --- are often used to identify cancerous cells [1]. However, prior research has utilized \\\"black-box\\\" algorithms, which are not appropriate for use in the life sciences [3]. In this study, a novel three-step framework was developed that combines the principles of biostatistics with transparent machine learning to create mathematical equations that predict cancer diagnoses using gene expression levels. First, an XGBoost model is trained on the training set, and the features with nonzero feature importances are carried onto the next step, where only genes that show a statistically significant difference (α=0.05) between expression patterns in cancerous and non-cancerous samples are retained. Finally, a novel symbolic regression-based algorithm called the QLattice (short for 'Quantum Lattice') is trained on the remaining features for 10 epochs using the Akaike Information Criterion as its loss function [2]. Table 1: Performance and Identified Biomarkers by Cancer To evaluate its performance, the framework was trained and tested on three datasets containing transcriptome profiles from cancerous and non-cancerous tissue for three different cancer types --- acute myeloid leukemia (AML), non-small cell lung cancer (NSCLC), and clear cell renal cell carcinoma (ccRCC). Table 1 shows the accuracies attained for each type as well as the biomarkers used in the mathematical expression (which together serve as a predictive gene signature), where an asterisk indicates that the gene has not been associated with that cancer type in previous literature. It should be noted that only three or four genes' expression levels are used in each case, while prior work has tended to use hundreds [1].\",\"PeriodicalId\":354504,\"journal\":{\"name\":\"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics\",\"volume\":\"33 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-08-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3535508.3545098\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3535508.3545098","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

癌症是一个广义的术语,其特征是不可控的和异常的细胞生长。每年有1930万新发病例和1000万癌症相关死亡,是全球第二大死亡原因[4]。作为一种癌症检测方法,被称为微阵列的工具经常用于识别癌细胞[1]。微阵列开发了一个转录组,即一次快速和系统地描述大量基因的表达。然而,先前的研究使用了“黑箱”算法,这并不适合在生命科学中使用[3]。在这项研究中,开发了一个新的三步框架,将生物统计学原理与透明机器学习相结合,创建了使用基因表达水平预测癌症诊断的数学方程。首先,在训练集上训练XGBoost模型,并将非零特征重要度的特征进行下一步,仅保留癌样和非癌样表达模式之间具有统计学显著差异(α=0.05)的基因。最后,一种新的基于符号回归的算法称为QLattice (Quantum Lattice的缩写),使用Akaike信息准则作为其损失函数,对剩余的10个epoch的特征进行训练[2]。为了评估其性能,该框架在三个数据集上进行了训练和测试,这些数据集包含三种不同癌症类型(急性髓性白血病(AML)、非小细胞肺癌(NSCLC)和透明细胞肾细胞癌(ccRCC)的癌性和非癌性组织的转录组谱。表1显示了每种类型获得的准确性以及数学表达中使用的生物标志物(它们一起作为预测性基因标记),其中星号表示该基因在以前的文献中与该癌症类型无关。值得注意的是,在每种情况下只使用三到四个基因的表达水平,而之前的工作往往使用数百个基因的表达水平[1]。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A novel three-step transcriptomic framework for cancer prediction
Cancer is a broad term for diseases characterized by uncontrollable and abnormal cell growth. With 19.3 million new cases and 10 million cancer-related deaths per annum, it is the second-leading cause of death worldwide [4]. As a method of cancer detection, tools known as microarrays --- which develop a transcriptome, i.e. a rapid and systematic profile of the expression of a large number of genes at once --- are often used to identify cancerous cells [1]. However, prior research has utilized "black-box" algorithms, which are not appropriate for use in the life sciences [3]. In this study, a novel three-step framework was developed that combines the principles of biostatistics with transparent machine learning to create mathematical equations that predict cancer diagnoses using gene expression levels. First, an XGBoost model is trained on the training set, and the features with nonzero feature importances are carried onto the next step, where only genes that show a statistically significant difference (α=0.05) between expression patterns in cancerous and non-cancerous samples are retained. Finally, a novel symbolic regression-based algorithm called the QLattice (short for 'Quantum Lattice') is trained on the remaining features for 10 epochs using the Akaike Information Criterion as its loss function [2]. Table 1: Performance and Identified Biomarkers by Cancer To evaluate its performance, the framework was trained and tested on three datasets containing transcriptome profiles from cancerous and non-cancerous tissue for three different cancer types --- acute myeloid leukemia (AML), non-small cell lung cancer (NSCLC), and clear cell renal cell carcinoma (ccRCC). Table 1 shows the accuracies attained for each type as well as the biomarkers used in the mathematical expression (which together serve as a predictive gene signature), where an asterisk indicates that the gene has not been associated with that cancer type in previous literature. It should be noted that only three or four genes' expression levels are used in each case, while prior work has tended to use hundreds [1].
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信