Dinithi V Wanniarachchi, Sameera Viswakula, Anushka M Wickramasuriya
{"title":"The evaluation of transcription factor binding site prediction tools in human and Arabidopsis genomes.","authors":"Dinithi V Wanniarachchi, Sameera Viswakula, Anushka M Wickramasuriya","doi":"10.1186/s12859-024-05995-0","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>The precise prediction of transcription factor binding sites (TFBSs) is pivotal for unraveling the gene regulatory networks underlying biological processes. While numerous tools have emerged for in silico TFBS prediction in recent years, the evolving landscape of computational biology necessitates thorough assessments of tool performance to ensure accuracy and reliability. Only a limited number of studies have been conducted to evaluate the performance of TFBS prediction tools comprehensively. Thus, the present study focused on assessing twelve widely used TFBS prediction tools and four de novo motif discovery tools using a benchmark dataset comprising real, generic, Markov, and negative sequences. TFBSs of Arabidopsis thaliana and Homo sapiens genomes downloaded from the JASPAR database were implanted in these sequences and the performance of tools was evaluated using several statistical parameters at different overlap percentages between the lengths of known and predicted binding sites.</p><p><strong>Results: </strong>Overall, the Multiple Cluster Alignment and Search Tool (MCAST) emerged as the best TFBS prediction tool, followed by Find Individual Motif Occurrences (FIMO) and MOtif Occurrence Detection Suite (MOODS). In addition, MotEvo and Dinucleotide Weight Tensor Toolbox (DWT-toolbox) demonstrated the highest sensitivity in identifying TFBSs at 90% and 80% overlap. Further, MCAST and DWT-toolbox managed to demonstrate the highest sensitivity across all three data types real, generic, and Markov. Among the de novo motif discovery tools, the Multiple Em for Motif Elicitation (MEME) emerged as the best performer. An analysis of the promoter regions of genes involved in the anthocyanin biosynthesis pathway in plants and the pentose phosphate pathway in humans, using the three best-performing tools, revealed considerable variation among the top 20 motifs identified by these tools.</p><p><strong>Conclusion: </strong>The findings of this study lay a robust groundwork for selecting optimal TFBS prediction tools for future research. Given the variability observed in tool performance, employing multiple tools for identifying TFBSs in a set of sequences is highly recommended. In addition, further studies are recommended to develop an integrated toolbox that incorporates TFBS prediction or motif discovery tools, aiming to streamline result precision and accuracy.</p>","PeriodicalId":8958,"journal":{"name":"BMC Bioinformatics","volume":"25 1","pages":"371"},"PeriodicalIF":2.9000,"publicationDate":"2024-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11613939/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s12859-024-05995-0","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0
Abstract
Background: The precise prediction of transcription factor binding sites (TFBSs) is pivotal for unraveling the gene regulatory networks underlying biological processes. While numerous tools have emerged for in silico TFBS prediction in recent years, the evolving landscape of computational biology necessitates thorough assessments of tool performance to ensure accuracy and reliability. Only a limited number of studies have been conducted to evaluate the performance of TFBS prediction tools comprehensively. Thus, the present study focused on assessing twelve widely used TFBS prediction tools and four de novo motif discovery tools using a benchmark dataset comprising real, generic, Markov, and negative sequences. TFBSs of Arabidopsis thaliana and Homo sapiens genomes downloaded from the JASPAR database were implanted in these sequences and the performance of tools was evaluated using several statistical parameters at different overlap percentages between the lengths of known and predicted binding sites.
Results: Overall, the Multiple Cluster Alignment and Search Tool (MCAST) emerged as the best TFBS prediction tool, followed by Find Individual Motif Occurrences (FIMO) and MOtif Occurrence Detection Suite (MOODS). In addition, MotEvo and Dinucleotide Weight Tensor Toolbox (DWT-toolbox) demonstrated the highest sensitivity in identifying TFBSs at 90% and 80% overlap. Further, MCAST and DWT-toolbox managed to demonstrate the highest sensitivity across all three data types real, generic, and Markov. Among the de novo motif discovery tools, the Multiple Em for Motif Elicitation (MEME) emerged as the best performer. An analysis of the promoter regions of genes involved in the anthocyanin biosynthesis pathway in plants and the pentose phosphate pathway in humans, using the three best-performing tools, revealed considerable variation among the top 20 motifs identified by these tools.
Conclusion: The findings of this study lay a robust groundwork for selecting optimal TFBS prediction tools for future research. Given the variability observed in tool performance, employing multiple tools for identifying TFBSs in a set of sequences is highly recommended. In addition, further studies are recommended to develop an integrated toolbox that incorporates TFBS prediction or motif discovery tools, aiming to streamline result precision and accuracy.
背景:转录因子结合位点(TFBSs)的精确预测对于揭示生物学过程背后的基因调控网络至关重要。虽然近年来出现了许多用于计算机TFBS预测的工具,但计算生物学的不断发展需要对工具性能进行全面评估,以确保准确性和可靠性。只有有限的研究对TFBS预测工具的性能进行了全面的评估。因此,本研究的重点是评估12种广泛使用的TFBS预测工具和4种全新的motif发现工具,使用基准数据集包括真实序列、通用序列、马尔可夫序列和负序列。将从JASPAR数据库下载的拟南芥和智人基因组的TFBSs植入到这些序列中,并使用已知和预测结合位点长度之间不同重叠百分比的几个统计参数来评估工具的性能。结果:总体而言,多簇比对和搜索工具(MCAST)是最佳的TFBS预测工具,其次是查找单个Motif Occurrence (FIMO)和Motif Occurrence Detection Suite (MOODS)。此外,MotEvo和二核苷酸权重张量工具箱(dwt -工具箱)在90%和80%重叠时识别TFBSs的灵敏度最高。此外,MCAST和DWT-toolbox成功地在所有三种数据类型(真实、通用和马尔可夫)中展示了最高的灵敏度。在全新的motif发现工具中,Multiple Em for motif Elicitation (MEME)表现最好。使用三种性能最好的工具对植物花青素生物合成途径和人类戊糖磷酸途径中涉及的基因启动子区域进行了分析,揭示了这些工具确定的前20个基序之间存在相当大的差异。结论:本研究结果为未来研究选择最佳的TFBS预测工具奠定了坚实的基础。考虑到工具性能的可变性,强烈建议使用多种工具在一组序列中识别tfbs。此外,建议进一步研究开发集成TFBS预测或motif发现工具的工具箱,以提高结果的精度和准确性。
期刊介绍:
BMC Bioinformatics is an open access, peer-reviewed journal that considers articles on all aspects of the development, testing and novel application of computational and statistical methods for the modeling and analysis of all kinds of biological data, as well as other areas of computational biology.
BMC Bioinformatics is part of the BMC series which publishes subject-specific journals focused on the needs of individual research communities across all areas of biology and medicine. We offer an efficient, fair and friendly peer review service, and are committed to publishing all sound science, provided that there is some advance in knowledge presented by the work.