{"title":"使用微阵列表达谱识别新的驱动样基因的机器学习方法","authors":"L. D. Mora, O. Azofeifa, D. Diaz, J. Guevara-Coto","doi":"10.1109/JoCICI48395.2019.9105274","DOIUrl":null,"url":null,"abstract":"Cancer is the second most important cause of deaths worldwide. Because of this, research efforts have generated vast amounts of data, such as gene expression profiles and cell lines to use in knowledge discovery research. By using the expression profiles from these cell lines, it is possible to identify novel driver-like candidates through the development of machine learning models. In this study, we focused on constructing a robust classifier capable of identifying new driver genes from a prediction set composed of non-coding genes. The training set was constructed using 300 known pan-cancer driver genes previously reported in the literature, and ~2700 non-driver genes. For our work we used two machine learning algorithms: random forests and support vector machines. During the construction of each model, optimization or fine-tuning was performed, which included feature selection using a random forest-based method, balancing of the classes due to the training set being imbalanced, and normalization in order to reduce the effects of extreme values, and to make samples comparable. Our results indicate that the highest performing model was the random forest, with an AUC-ROC of 0,9696. When applied in the prediction set, it identified 525 potential non-coding driver-like genes, with potential association to cancer. We expect that a next step would be to functionally annotate these candidates and look for co-expression between the diversity of cancer that data exposes.","PeriodicalId":154731,"journal":{"name":"2019 IV Jornadas Costarricenses de Investigación en Computación e Informática (JoCICI)","volume":"132 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Machine learning approaches for the identification of new driver-like genes using microarray expression profiles\",\"authors\":\"L. D. Mora, O. Azofeifa, D. Diaz, J. Guevara-Coto\",\"doi\":\"10.1109/JoCICI48395.2019.9105274\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Cancer is the second most important cause of deaths worldwide. Because of this, research efforts have generated vast amounts of data, such as gene expression profiles and cell lines to use in knowledge discovery research. By using the expression profiles from these cell lines, it is possible to identify novel driver-like candidates through the development of machine learning models. In this study, we focused on constructing a robust classifier capable of identifying new driver genes from a prediction set composed of non-coding genes. The training set was constructed using 300 known pan-cancer driver genes previously reported in the literature, and ~2700 non-driver genes. For our work we used two machine learning algorithms: random forests and support vector machines. During the construction of each model, optimization or fine-tuning was performed, which included feature selection using a random forest-based method, balancing of the classes due to the training set being imbalanced, and normalization in order to reduce the effects of extreme values, and to make samples comparable. Our results indicate that the highest performing model was the random forest, with an AUC-ROC of 0,9696. When applied in the prediction set, it identified 525 potential non-coding driver-like genes, with potential association to cancer. We expect that a next step would be to functionally annotate these candidates and look for co-expression between the diversity of cancer that data exposes.\",\"PeriodicalId\":154731,\"journal\":{\"name\":\"2019 IV Jornadas Costarricenses de Investigación en Computación e Informática (JoCICI)\",\"volume\":\"132 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 IV Jornadas Costarricenses de Investigación en Computación e Informática (JoCICI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/JoCICI48395.2019.9105274\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IV Jornadas Costarricenses de Investigación en Computación e Informática (JoCICI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/JoCICI48395.2019.9105274","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Machine learning approaches for the identification of new driver-like genes using microarray expression profiles
Cancer is the second most important cause of deaths worldwide. Because of this, research efforts have generated vast amounts of data, such as gene expression profiles and cell lines to use in knowledge discovery research. By using the expression profiles from these cell lines, it is possible to identify novel driver-like candidates through the development of machine learning models. In this study, we focused on constructing a robust classifier capable of identifying new driver genes from a prediction set composed of non-coding genes. The training set was constructed using 300 known pan-cancer driver genes previously reported in the literature, and ~2700 non-driver genes. For our work we used two machine learning algorithms: random forests and support vector machines. During the construction of each model, optimization or fine-tuning was performed, which included feature selection using a random forest-based method, balancing of the classes due to the training set being imbalanced, and normalization in order to reduce the effects of extreme values, and to make samples comparable. Our results indicate that the highest performing model was the random forest, with an AUC-ROC of 0,9696. When applied in the prediction set, it identified 525 potential non-coding driver-like genes, with potential association to cancer. We expect that a next step would be to functionally annotate these candidates and look for co-expression between the diversity of cancer that data exposes.