TeaTFactor: A Prediction Tool for Tea Plant Transcription Factors Based on BERT

IF 3.4 3区生物学 Q2 BIOCHEMICAL RESEARCH METHODS

IEEE/ACM Transactions on Computational Biology and Bioinformatics Pub Date : 2024-08-16 DOI:10.1109/TCBB.2024.3444466

Qinan Tang;Ying Xiang;Wanling Gao;Liqiang Zhu;Zishu Xu;Yeyun Li;Zhenyu Yue

{"title":"TeaTFactor: A Prediction Tool for Tea Plant Transcription Factors Based on BERT","authors":"Qinan Tang;Ying Xiang;Wanling Gao;Liqiang Zhu;Zishu Xu;Yeyun Li;Zhenyu Yue","doi":"10.1109/TCBB.2024.3444466","DOIUrl":null,"url":null,"abstract":"A transcription factor (TF) is a sequence-specific DNA-binding protein, which plays key roles in cell-fate decision by regulating gene expression. Predicting TFs is key for tea plant research community, as they regulate gene expression, influencing plant growth, development, and stress responses. It is a challenging task through wet lab experimental validation, due to their rarity, as well as the high cost and time requirements. As a result, computational methods are increasingly popular to be chosen. The pre-training strategy has been applied to many tasks in natural language processing (NLP) and has achieved impressive performance. In this paper, we present a novel recognition algorithm named TeaTFactor that utilizes pre-training for the model training of TFs prediction. The model is built upon the BERT architecture, initially pre-trained using protein data from UniProt. Subsequently, the model was fine-tuned using the collected TFs data of tea plants. We evaluated four different word segmentation methods and the existing state-of-the-art prediction tools. According to the comprehensive experimental results and a case study, our model is superior to existing models and achieves the goal of accurate identification. In addition, we have developed a web server at \n<uri>http://teatfactor.tlds.cc</uri>\n, which we believe will facilitate future studies on tea transcription factors and advance the field of crop synthetic biology.","PeriodicalId":13344,"journal":{"name":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","volume":"21 6","pages":"2123-2132"},"PeriodicalIF":3.4000,"publicationDate":"2024-08-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Computational Biology and Bioinformatics","FirstCategoryId":"5","ListUrlMain":"https://ieeexplore.ieee.org/document/10637723/","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

A transcription factor (TF) is a sequence-specific DNA-binding protein, which plays key roles in cell-fate decision by regulating gene expression. Predicting TFs is key for tea plant research community, as they regulate gene expression, influencing plant growth, development, and stress responses. It is a challenging task through wet lab experimental validation, due to their rarity, as well as the high cost and time requirements. As a result, computational methods are increasingly popular to be chosen. The pre-training strategy has been applied to many tasks in natural language processing (NLP) and has achieved impressive performance. In this paper, we present a novel recognition algorithm named TeaTFactor that utilizes pre-training for the model training of TFs prediction. The model is built upon the BERT architecture, initially pre-trained using protein data from UniProt. Subsequently, the model was fine-tuned using the collected TFs data of tea plants. We evaluated four different word segmentation methods and the existing state-of-the-art prediction tools. According to the comprehensive experimental results and a case study, our model is superior to existing models and achieves the goal of accurate identification. In addition, we have developed a web server at http://teatfactor.tlds.cc , which we believe will facilitate future studies on tea transcription factors and advance the field of crop synthetic biology.

查看原文本刊更多论文

TeaTFactor：基于BERT的茶树转录因子预测工具。

转录因子（TF）是一种序列特异的 DNA 结合蛋白，通过调控基因表达在细胞命运决定中发挥关键作用。转录因子调控基因表达，影响植物的生长、发育和胁迫反应，因此预测转录因子是茶叶植物研究界的关键。由于其稀有性、高成本和时间要求，通过湿实验室实验验证是一项具有挑战性的任务。因此，越来越多的人选择计算方法。预训练策略已被应用到自然语言处理（NLP）的许多任务中，并取得了令人瞩目的成绩。在本文中，我们提出了一种名为 TeaTFactor 的新型识别算法，它利用预训练来进行 TFs 预测的模型训练。该模型基于 BERT 架构，最初使用 UniProt 中的蛋白质数据进行预训练。随后，利用收集到的茶树 TFs 数据对模型进行了微调。我们评估了四种不同的单词分割方法和现有的最先进预测工具。根据综合实验结果和案例研究，我们的模型优于现有模型，实现了准确识别的目标。此外，我们还在 http://teatfactor.tlds.cc 网站上开发了一个网络服务器，相信这将有助于今后对茶叶转录因子的研究，并推动作物合成生物学领域的发展。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE/ACM Transactions on Computational Biology and Bioinformatics 工程技术-计算机：跨学科应用

CiteScore

7.50

自引率

6.70%

发文量

479

审稿时长

3 months

期刊介绍： IEEE/ACM Transactions on Computational Biology and Bioinformatics emphasizes the algorithmic, mathematical, statistical and computational methods that are central in bioinformatics and computational biology; the development and testing of effective computer programs in bioinformatics; the development of biological databases; and important biological results that are obtained from the use of these methods, programs and databases; the emerging field of Systems Biology, where many forms of data are used to create a computer-based model of a complex biological system