使用微阵列表达谱识别新的驱动样基因的机器学习方法

L. D. Mora, O. Azofeifa, D. Diaz, J. Guevara-Coto
{"title":"使用微阵列表达谱识别新的驱动样基因的机器学习方法","authors":"L. D. Mora, O. Azofeifa, D. Diaz, J. Guevara-Coto","doi":"10.1109/JoCICI48395.2019.9105274","DOIUrl":null,"url":null,"abstract":"Cancer is the second most important cause of deaths worldwide. Because of this, research efforts have generated vast amounts of data, such as gene expression profiles and cell lines to use in knowledge discovery research. By using the expression profiles from these cell lines, it is possible to identify novel driver-like candidates through the development of machine learning models. In this study, we focused on constructing a robust classifier capable of identifying new driver genes from a prediction set composed of non-coding genes. The training set was constructed using 300 known pan-cancer driver genes previously reported in the literature, and ~2700 non-driver genes. For our work we used two machine learning algorithms: random forests and support vector machines. During the construction of each model, optimization or fine-tuning was performed, which included feature selection using a random forest-based method, balancing of the classes due to the training set being imbalanced, and normalization in order to reduce the effects of extreme values, and to make samples comparable. Our results indicate that the highest performing model was the random forest, with an AUC-ROC of 0,9696. When applied in the prediction set, it identified 525 potential non-coding driver-like genes, with potential association to cancer. We expect that a next step would be to functionally annotate these candidates and look for co-expression between the diversity of cancer that data exposes.","PeriodicalId":154731,"journal":{"name":"2019 IV Jornadas Costarricenses de Investigación en Computación e Informática (JoCICI)","volume":"132 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Machine learning approaches for the identification of new driver-like genes using microarray expression profiles\",\"authors\":\"L. D. Mora, O. Azofeifa, D. Diaz, J. Guevara-Coto\",\"doi\":\"10.1109/JoCICI48395.2019.9105274\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Cancer is the second most important cause of deaths worldwide. Because of this, research efforts have generated vast amounts of data, such as gene expression profiles and cell lines to use in knowledge discovery research. By using the expression profiles from these cell lines, it is possible to identify novel driver-like candidates through the development of machine learning models. In this study, we focused on constructing a robust classifier capable of identifying new driver genes from a prediction set composed of non-coding genes. The training set was constructed using 300 known pan-cancer driver genes previously reported in the literature, and ~2700 non-driver genes. For our work we used two machine learning algorithms: random forests and support vector machines. During the construction of each model, optimization or fine-tuning was performed, which included feature selection using a random forest-based method, balancing of the classes due to the training set being imbalanced, and normalization in order to reduce the effects of extreme values, and to make samples comparable. Our results indicate that the highest performing model was the random forest, with an AUC-ROC of 0,9696. When applied in the prediction set, it identified 525 potential non-coding driver-like genes, with potential association to cancer. We expect that a next step would be to functionally annotate these candidates and look for co-expression between the diversity of cancer that data exposes.\",\"PeriodicalId\":154731,\"journal\":{\"name\":\"2019 IV Jornadas Costarricenses de Investigación en Computación e Informática (JoCICI)\",\"volume\":\"132 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IV Jornadas Costarricenses de Investigación en Computación e Informática (JoCICI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/JoCICI48395.2019.9105274\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IV Jornadas Costarricenses de Investigación en Computación e Informática (JoCICI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/JoCICI48395.2019.9105274","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

癌症是全球第二大致死原因。正因为如此,研究工作已经产生了大量的数据,如基因表达谱和细胞系,用于知识发现研究。通过使用这些细胞系的表达谱,可以通过开发机器学习模型来识别新的类似驱动程序的候选基因。在这项研究中,我们专注于构建一个鲁棒分类器,能够从由非编码基因组成的预测集中识别新的驱动基因。该训练集是使用300个已知的文献报道的泛癌驱动基因和约2700个非驱动基因构建的。在我们的工作中,我们使用了两种机器学习算法:随机森林和支持向量机。在每个模型的构建过程中,都会进行优化或微调,其中包括使用基于随机森林的方法进行特征选择,由于训练集不平衡而平衡类,以及归一化以减少极值的影响,并使样本具有可比性。我们的结果表明,表现最好的模型是随机森林,AUC-ROC为0,9696。当应用于预测集时,它确定了525个潜在的非编码驱动基因,与癌症有潜在的关联。我们预计下一步将是对这些候选基因进行功能性注释,并寻找数据显示的癌症多样性之间的共表达。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Machine learning approaches for the identification of new driver-like genes using microarray expression profiles
Cancer is the second most important cause of deaths worldwide. Because of this, research efforts have generated vast amounts of data, such as gene expression profiles and cell lines to use in knowledge discovery research. By using the expression profiles from these cell lines, it is possible to identify novel driver-like candidates through the development of machine learning models. In this study, we focused on constructing a robust classifier capable of identifying new driver genes from a prediction set composed of non-coding genes. The training set was constructed using 300 known pan-cancer driver genes previously reported in the literature, and ~2700 non-driver genes. For our work we used two machine learning algorithms: random forests and support vector machines. During the construction of each model, optimization or fine-tuning was performed, which included feature selection using a random forest-based method, balancing of the classes due to the training set being imbalanced, and normalization in order to reduce the effects of extreme values, and to make samples comparable. Our results indicate that the highest performing model was the random forest, with an AUC-ROC of 0,9696. When applied in the prediction set, it identified 525 potential non-coding driver-like genes, with potential association to cancer. We expect that a next step would be to functionally annotate these candidates and look for co-expression between the diversity of cancer that data exposes.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信