{"title":"GATA TF Class Classifier: AI-based functional prediction and taxonomic profiling in angiosperm GATA transcription factors","authors":"Mangi Kim","doi":"10.1016/j.biosystems.2025.105589","DOIUrl":null,"url":null,"abstract":"<div><div>GATA transcription factors (TFs) are key regulators of diverse physiological and developmental processes in angiosperms. Although they are traditionally classified into four functional classes (A-D) based on phylogenetic relationships, large-scale classification across plant genomes remains limited by the labor-intensive nature of tree-based approaches. To overcome this limitation, this study presents the GATA TF Class Classifier, a scalable sequence-based tool for genome-wide functional classification of GATA TFs across angiosperm species. The model was trained on 700 curated full-length sequences from 23 species, encoded with ProtBERT, reduced via principal component analysis (PCA) with six additional features, and classified into functional classes using a support vector machine (SVM). The model achieved an average accuracy of 94.29 %, with balanced performance across all classes, as confirmed by repeated stratified 5-fold cross-validation. When applied to 4170 GATA TFs from 121 angiosperm genomes, the classifier showed that classes A and B were relatively abundant, whereas classes C and D were less represented, implying that each class may perform distinct biological functions. In addition, this study performed a taxonomic analysis of the predicted GATA TF classes to investigate their characteristics across major angiosperm lineages. Taken together, the classifier facilitates large-scale annotation and offers insights into the lineage-specific diversification and functional evolution of GATA TFs in angiosperms.</div></div>","PeriodicalId":50730,"journal":{"name":"Biosystems","volume":"257 ","pages":"Article 105589"},"PeriodicalIF":1.9000,"publicationDate":"2025-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biosystems","FirstCategoryId":"99","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0303264725001996","RegionNum":4,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
GATA transcription factors (TFs) are key regulators of diverse physiological and developmental processes in angiosperms. Although they are traditionally classified into four functional classes (A-D) based on phylogenetic relationships, large-scale classification across plant genomes remains limited by the labor-intensive nature of tree-based approaches. To overcome this limitation, this study presents the GATA TF Class Classifier, a scalable sequence-based tool for genome-wide functional classification of GATA TFs across angiosperm species. The model was trained on 700 curated full-length sequences from 23 species, encoded with ProtBERT, reduced via principal component analysis (PCA) with six additional features, and classified into functional classes using a support vector machine (SVM). The model achieved an average accuracy of 94.29 %, with balanced performance across all classes, as confirmed by repeated stratified 5-fold cross-validation. When applied to 4170 GATA TFs from 121 angiosperm genomes, the classifier showed that classes A and B were relatively abundant, whereas classes C and D were less represented, implying that each class may perform distinct biological functions. In addition, this study performed a taxonomic analysis of the predicted GATA TF classes to investigate their characteristics across major angiosperm lineages. Taken together, the classifier facilitates large-scale annotation and offers insights into the lineage-specific diversification and functional evolution of GATA TFs in angiosperms.
期刊介绍:
BioSystems encourages experimental, computational, and theoretical articles that link biology, evolutionary thinking, and the information processing sciences. The link areas form a circle that encompasses the fundamental nature of biological information processing, computational modeling of complex biological systems, evolutionary models of computation, the application of biological principles to the design of novel computing systems, and the use of biomolecular materials to synthesize artificial systems that capture essential principles of natural biological information processing.