A hybrid approach for predicting transcription factors.

IF 2.8 Q2 MATHEMATICAL & COMPUTATIONAL BIOLOGY

Frontiers in bioinformatics Pub Date : 2024-07-25 eCollection Date: 2024-01-01 DOI:10.3389/fbinf.2024.1425419

Sumeet Patiyal, Palak Tiwari, Mohit Ghai, Aman Dhapola, Anjali Dhall, Gajendra P S Raghava

{"title":"A hybrid approach for predicting transcription factors.","authors":"Sumeet Patiyal, Palak Tiwari, Mohit Ghai, Aman Dhapola, Anjali Dhall, Gajendra P S Raghava","doi":"10.3389/fbinf.2024.1425419","DOIUrl":null,"url":null,"abstract":"<p><p>Transcription factors are essential DNA-binding proteins that regulate the transcription rate of several genes and control the expression of genes inside a cell. The prediction of transcription factors with high precision is important for understanding biological processes such as cell differentiation, intracellular signaling, and cell-cycle control. In this study, we developed a hybrid method that combines alignment-based and alignment-free methods for predicting transcription factors with higher accuracy. All models have been trained, tested, and evaluated on a large dataset that contains 19,406 transcription factors and 523,560 non-transcription factor protein sequences. To avoid biases in evaluation, the datasets were divided into training and validation/independent datasets, where 80% of the data was used for training, and the remaining 20% was used for external validation. In the case of alignment-free methods, models were developed using machine learning techniques and the composition-based features of a protein. Our best alignment-free model obtained an AUC of 0.97 on an independent dataset. In the case of the alignment-based method, we used BLAST at different cut-offs to predict the transcription factors. Although the alignment-based method demonstrated excellent performance, it was unable to cover all transcription factors due to instances of no hits. To combine the strengths of both methods, we developed a hybrid method that combines alignment-free and alignment-based methods. In the hybrid method, we added the scores of the alignment-free and alignment-based methods and achieved a maximum AUC of 0.99 on the independent dataset. The method proposed in this study performs better than existing methods. We incorporated the best models in the webserver/Python Package Index/standalone package of \"TransFacPred\" (https://webs.iiitd.edu.in/raghava/transfacpred).</p>","PeriodicalId":73066,"journal":{"name":"Frontiers in bioinformatics","volume":"4 ","pages":"1425419"},"PeriodicalIF":2.8000,"publicationDate":"2024-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11306938/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in bioinformatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fbinf.2024.1425419","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Transcription factors are essential DNA-binding proteins that regulate the transcription rate of several genes and control the expression of genes inside a cell. The prediction of transcription factors with high precision is important for understanding biological processes such as cell differentiation, intracellular signaling, and cell-cycle control. In this study, we developed a hybrid method that combines alignment-based and alignment-free methods for predicting transcription factors with higher accuracy. All models have been trained, tested, and evaluated on a large dataset that contains 19,406 transcription factors and 523,560 non-transcription factor protein sequences. To avoid biases in evaluation, the datasets were divided into training and validation/independent datasets, where 80% of the data was used for training, and the remaining 20% was used for external validation. In the case of alignment-free methods, models were developed using machine learning techniques and the composition-based features of a protein. Our best alignment-free model obtained an AUC of 0.97 on an independent dataset. In the case of the alignment-based method, we used BLAST at different cut-offs to predict the transcription factors. Although the alignment-based method demonstrated excellent performance, it was unable to cover all transcription factors due to instances of no hits. To combine the strengths of both methods, we developed a hybrid method that combines alignment-free and alignment-based methods. In the hybrid method, we added the scores of the alignment-free and alignment-based methods and achieved a maximum AUC of 0.99 on the independent dataset. The method proposed in this study performs better than existing methods. We incorporated the best models in the webserver/Python Package Index/standalone package of "TransFacPred" (https://webs.iiitd.edu.in/raghava/transfacpred).

查看原文本刊更多论文

预测转录因子的混合方法。

转录因子是重要的 DNA 结合蛋白，可调节多个基因的转录速率，控制细胞内基因的表达。高精度预测转录因子对于了解细胞分化、细胞内信号转导和细胞周期控制等生物过程非常重要。在这项研究中，我们开发了一种混合方法，结合了基于配准和无配准的方法，以更高的精度预测转录因子。所有模型都在一个包含 19,406 个转录因子和 523,560 个非转录因子蛋白质序列的大型数据集上进行了训练、测试和评估。为避免评估中的偏差，数据集被分为训练数据集和验证/独立数据集，其中 80% 的数据用于训练，其余 20% 用于外部验证。在无配准方法中，使用机器学习技术和基于蛋白质组成的特征来开发模型。在一个独立数据集上，我们的最佳无配准模型获得了 0.97 的 AUC。在基于配准的方法中，我们使用不同截断值的 BLAST 来预测转录因子。虽然基于配准的方法表现出了卓越的性能，但由于存在无命中的情况，它无法覆盖所有转录因子。为了结合这两种方法的优势，我们开发了一种混合方法，将无配准和基于配准的方法结合起来。在混合方法中，我们将免配准方法和基于配准方法的得分相加，在独立数据集上取得了 0.99 的最大 AUC。本研究提出的方法比现有方法表现更好。我们将最佳模型纳入了 "TransFacPred"（https://webs.iiitd.edu.in/raghava/transfacpred）的网络服务器/Python软件包索引/独立软件包中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Frontiers in bioinformatics

CiteScore

2.60

自引率

0.00%

发文量