使用随机森林方法预测新型小鼠TLR9激动剂。

IF 2.4 3区生物学 Q4 CELL BIOLOGY

BMC Molecular and Cell Biology Pub Date : 2019-12-20 DOI:10.1186/s12860-019-0241-0

Varun Khanna, Lei Li, Johnson Fung, Shoba Ranganathan, Nikolai Petrovsky

{"title":"使用随机森林方法预测新型小鼠TLR9激动剂。","authors":"Varun Khanna, Lei Li, Johnson Fung, Shoba Ranganathan, Nikolai Petrovsky","doi":"10.1186/s12860-019-0241-0","DOIUrl":null,"url":null,"abstract":"Background: Toll-like receptor 9 is a key innate immune receptor involved in detecting infectious diseases and cancer. TLR9 activates the innate immune system following the recognition of single-stranded DNA oligonucleotides (ODN) containing unmethylated cytosine-guanine (CpG) motifs. Due to the considerable number of rotatable bonds in ODNs, high-throughput in silico screening for potential TLR9 activity via traditional structure-based virtual screening approaches of CpG ODNs is challenging. In the current study, we present a machine learning based method for predicting novel mouse TLR9 (mTLR9) agonists based on features including count and position of motifs, the distance between the motifs and graphically derived features such as the radius of gyration and moment of Inertia. We employed an in-house experimentally validated dataset of 396 single-stranded synthetic ODNs, to compare the results of five machine learning algorithms. Since the dataset was highly imbalanced, we used an ensemble learning approach based on repeated random down-sampling.Results: Using in-house experimental TLR9 activity data we found that random forest algorithm outperformed other algorithms for our dataset for TLR9 activity prediction. Therefore, we developed a cross-validated ensemble classifier of 20 random forest models. The average Matthews correlation coefficient and balanced accuracy of our ensemble classifier in test samples was 0.61 and 80.0%, respectively, with the maximum balanced accuracy and Matthews correlation coefficient of 87.0% and 0.75, respectively. We confirmed common sequence motifs including 'CC', 'GG','AG', 'CCCG' and 'CGGC' were overrepresented in mTLR9 agonists. Predictions on 6000 randomly generated ODNs were ranked and the top 100 ODNs were synthesized and experimentally tested for activity in a mTLR9 reporter cell assay, with 91 of the 100 selected ODNs showing high activity, confirming the accuracy of the model in predicting mTLR9 activity.Conclusion: We combined repeated random down-sampling with random forest to overcome the class imbalance problem and achieved promising results. Overall, we showed that the random forest algorithm outperformed other machine learning algorithms including support vector machines, shrinkage discriminant analysis, gradient boosting machine and neural networks. Due to its predictive performance and simplicity, the random forest technique is a useful method for prediction of mTLR9 ODN agonists.","PeriodicalId":9099,"journal":{"name":"BMC Molecular and Cell Biology","volume":"20 Suppl 2","pages":"56"},"PeriodicalIF":2.4000,"publicationDate":"2019-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1186/s12860-019-0241-0","citationCount":"4","resultStr":"{\"title\":\"Prediction of novel mouse TLR9 agonists using a random forest approach.\",\"authors\":\"Varun Khanna, Lei Li, Johnson Fung, Shoba Ranganathan, Nikolai Petrovsky\",\"doi\":\"10.1186/s12860-019-0241-0\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Toll-like receptor 9 is a key innate immune receptor involved in detecting infectious diseases and cancer. TLR9 activates the innate immune system following the recognition of single-stranded DNA oligonucleotides (ODN) containing unmethylated cytosine-guanine (CpG) motifs. Due to the considerable number of rotatable bonds in ODNs, high-throughput in silico screening for potential TLR9 activity via traditional structure-based virtual screening approaches of CpG ODNs is challenging. In the current study, we present a machine learning based method for predicting novel mouse TLR9 (mTLR9) agonists based on features including count and position of motifs, the distance between the motifs and graphically derived features such as the radius of gyration and moment of Inertia. We employed an in-house experimentally validated dataset of 396 single-stranded synthetic ODNs, to compare the results of five machine learning algorithms. Since the dataset was highly imbalanced, we used an ensemble learning approach based on repeated random down-sampling.Results: Using in-house experimental TLR9 activity data we found that random forest algorithm outperformed other algorithms for our dataset for TLR9 activity prediction. Therefore, we developed a cross-validated ensemble classifier of 20 random forest models. The average Matthews correlation coefficient and balanced accuracy of our ensemble classifier in test samples was 0.61 and 80.0%, respectively, with the maximum balanced accuracy and Matthews correlation coefficient of 87.0% and 0.75, respectively. We confirmed common sequence motifs including 'CC', 'GG','AG', 'CCCG' and 'CGGC' were overrepresented in mTLR9 agonists. Predictions on 6000 randomly generated ODNs were ranked and the top 100 ODNs were synthesized and experimentally tested for activity in a mTLR9 reporter cell assay, with 91 of the 100 selected ODNs showing high activity, confirming the accuracy of the model in predicting mTLR9 activity.Conclusion: We combined repeated random down-sampling with random forest to overcome the class imbalance problem and achieved promising results. Overall, we showed that the random forest algorithm outperformed other machine learning algorithms including support vector machines, shrinkage discriminant analysis, gradient boosting machine and neural networks. Due to its predictive performance and simplicity, the random forest technique is a useful method for prediction of mTLR9 ODN agonists.\",\"PeriodicalId\":9099,\"journal\":{\"name\":\"BMC Molecular and Cell Biology\",\"volume\":\"20 Suppl 2\",\"pages\":\"56\"},\"PeriodicalIF\":2.4000,\"publicationDate\":\"2019-12-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://sci-hub-pdf.com/10.1186/s12860-019-0241-0\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMC Molecular and Cell Biology\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1186/s12860-019-0241-0\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"CELL BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Molecular and Cell Biology","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12860-019-0241-0","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"CELL BIOLOGY","Score":null,"Total":0}

引用次数: 4

摘要

背景：Toll-like受体9是一种关键的先天免疫受体，参与检测传染病和癌症。TLR9在识别含有未甲基化胞嘧啶-鸟嘌呤（CpG）基序的单链DNA寡核苷酸（ODN）后激活先天免疫系统。由于ODN中有相当多的可旋转键，通过传统的基于结构的CpG-ODN虚拟筛选方法对潜在TLR9活性进行高通量的计算机筛选是具有挑战性的。在目前的研究中，我们提出了一种基于机器学习的方法，用于基于特征预测新型小鼠TLR9（mTLR9）激动剂，这些特征包括基序的计数和位置、基序之间的距离以及图形衍生的特征，如回转半径和惯性矩。我们使用了396个单链合成ODN的内部实验验证数据集，来比较五种机器学习算法的结果。由于数据集高度不平衡，我们使用了一种基于重复随机下采样的集成学习方法。结果：使用内部实验TLR9活性数据，我们发现随机森林算法在TLR9活动预测方面优于我们数据集的其他算法。因此，我们开发了一个由20个随机森林模型组成的交叉验证集成分类器。我们的集成分类器在测试样本中的平均Matthews相关系数和平衡准确度分别为0.61和80.0%，最大平衡准确度和Matthews相关性系数分别为87.0%和0.75。我们证实，包括“CC”、“GG”、“G”、“CCCG”和“CGGC”在内的常见序列基序在mTLR9激动剂中过度表达。对6000个随机产生的ODN的预测进行了排序，合成了前100个ODN，并在mTLR9报告细胞测定中对其活性进行了实验测试，100个选定的ODN中有91个显示出高活性，证实了该模型在预测mTLR9的活性方面的准确性。结论：我们将重复随机下采样与随机森林相结合，克服了类不平衡问题，取得了良好的效果。总体而言，我们表明随机森林算法优于其他机器学习算法，包括支持向量机、收缩判别分析、梯度增强机和神经网络。由于其预测性能和简单性，随机森林技术是预测mTLR9 ODN激动剂的有用方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

Prediction of novel mouse TLR9 agonists using a random forest approach.

查看原文本刊更多论文

Prediction of novel mouse TLR9 agonists using a random forest approach.

Background: Toll-like receptor 9 is a key innate immune receptor involved in detecting infectious diseases and cancer. TLR9 activates the innate immune system following the recognition of single-stranded DNA oligonucleotides (ODN) containing unmethylated cytosine-guanine (CpG) motifs. Due to the considerable number of rotatable bonds in ODNs, high-throughput in silico screening for potential TLR9 activity via traditional structure-based virtual screening approaches of CpG ODNs is challenging. In the current study, we present a machine learning based method for predicting novel mouse TLR9 (mTLR9) agonists based on features including count and position of motifs, the distance between the motifs and graphically derived features such as the radius of gyration and moment of Inertia. We employed an in-house experimentally validated dataset of 396 single-stranded synthetic ODNs, to compare the results of five machine learning algorithms. Since the dataset was highly imbalanced, we used an ensemble learning approach based on repeated random down-sampling.

Results: Using in-house experimental TLR9 activity data we found that random forest algorithm outperformed other algorithms for our dataset for TLR9 activity prediction. Therefore, we developed a cross-validated ensemble classifier of 20 random forest models. The average Matthews correlation coefficient and balanced accuracy of our ensemble classifier in test samples was 0.61 and 80.0%, respectively, with the maximum balanced accuracy and Matthews correlation coefficient of 87.0% and 0.75, respectively. We confirmed common sequence motifs including 'CC', 'GG','AG', 'CCCG' and 'CGGC' were overrepresented in mTLR9 agonists. Predictions on 6000 randomly generated ODNs were ranked and the top 100 ODNs were synthesized and experimentally tested for activity in a mTLR9 reporter cell assay, with 91 of the 100 selected ODNs showing high activity, confirming the accuracy of the model in predicting mTLR9 activity.

Conclusion: We combined repeated random down-sampling with random forest to overcome the class imbalance problem and achieved promising results. Overall, we showed that the random forest algorithm outperformed other machine learning algorithms including support vector machines, shrinkage discriminant analysis, gradient boosting machine and neural networks. Due to its predictive performance and simplicity, the random forest technique is a useful method for prediction of mTLR9 ODN agonists.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

BMC Molecular and Cell Biology Biochemistry, Genetics and Molecular Biology-Cell Biology

CiteScore

5.50

自引率

0.00%

发文量

审稿时长

27 weeks