Brendon K Myers, Anuj Lamichhane, Brian H Kvitko, Bhabesh Dutta
{"title":"NLP-like deep learning aided in identification and validation of thiosulfinate tolerance clusters in diverse bacteria.","authors":"Brendon K Myers, Anuj Lamichhane, Brian H Kvitko, Bhabesh Dutta","doi":"10.1128/msphere.00023-25","DOIUrl":null,"url":null,"abstract":"<p><p>Allicin tolerance (<i>alt</i>) clusters in phytopathogenic bacteria, which provide resistance to thiosulfinates like allicin, are challenging to find using conventional approaches due to their varied architecture and the paradox of being vertically maintained within genera despite likely being horizontally transferred. This results in significant sequential diversity that further complicates their identification. Natural language processing (NLP), like techniques such as those used in DeepBGC, offers a promising solution by treating gene clusters like a language, allowing for identifying and collecting gene clusters based on patterns and relationships within the sequences. We curated and validated <i>alt</i>-like clusters in <i>Pantoea ananatis</i> 97-1R, <i>Burkholderia gladioli</i> pv. <i>gladioli</i> FDAARGOS 389, and <i>Pseudomonas syringae</i> pv. tomato DC3000. Leveraging sequences from the RefSeq bacterial database, we conducted comparative analyses of gene synteny, gene/protein sequences, protein structures, and predicted protein interactions. This approach enabled the discovery of several novel <i>alt</i>-like clusters previously undetectable by other methods, which were further validated experimentally. Our work highlights the effectiveness of NLP-like techniques for identifying underrepresented gene clusters and expands our understanding of the diversity and utility of <i>alt</i>-like clusters in diverse bacterial genera. This work demonstrates the potential of these techniques to simplify the identification process and enhance the applicability of biological data in real-world scenarios.IMPORTANCEThiosulfinates, like allicin, are potent antifeedants and antimicrobials produced by <i>Allium</i> species and pose a challenge for phytopathogenic bacteria. Phytopathogenic bacteria have been shown to utilize an allicin tolerance (<i>alt</i>) gene cluster to circumvent this host response, leading to economically significant yield losses. Due to the complexity of mining these clusters, we applied techniques akin to natural language processing to analyze Pfam domains and gene proximity. This approach led to the identification of novel <i>alt</i>-like gene clusters, showcasing the potential of artificial intelligence to reveal elusive and underrepresented genetic clusters and enhance our understanding of their diversity and role across various bacterial genera.</p>","PeriodicalId":19052,"journal":{"name":"mSphere","volume":" ","pages":"e0002325"},"PeriodicalIF":3.1000,"publicationDate":"2025-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12306174/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"mSphere","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1128/msphere.00023-25","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/6/17 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"MICROBIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Allicin tolerance (alt) clusters in phytopathogenic bacteria, which provide resistance to thiosulfinates like allicin, are challenging to find using conventional approaches due to their varied architecture and the paradox of being vertically maintained within genera despite likely being horizontally transferred. This results in significant sequential diversity that further complicates their identification. Natural language processing (NLP), like techniques such as those used in DeepBGC, offers a promising solution by treating gene clusters like a language, allowing for identifying and collecting gene clusters based on patterns and relationships within the sequences. We curated and validated alt-like clusters in Pantoea ananatis 97-1R, Burkholderia gladioli pv. gladioli FDAARGOS 389, and Pseudomonas syringae pv. tomato DC3000. Leveraging sequences from the RefSeq bacterial database, we conducted comparative analyses of gene synteny, gene/protein sequences, protein structures, and predicted protein interactions. This approach enabled the discovery of several novel alt-like clusters previously undetectable by other methods, which were further validated experimentally. Our work highlights the effectiveness of NLP-like techniques for identifying underrepresented gene clusters and expands our understanding of the diversity and utility of alt-like clusters in diverse bacterial genera. This work demonstrates the potential of these techniques to simplify the identification process and enhance the applicability of biological data in real-world scenarios.IMPORTANCEThiosulfinates, like allicin, are potent antifeedants and antimicrobials produced by Allium species and pose a challenge for phytopathogenic bacteria. Phytopathogenic bacteria have been shown to utilize an allicin tolerance (alt) gene cluster to circumvent this host response, leading to economically significant yield losses. Due to the complexity of mining these clusters, we applied techniques akin to natural language processing to analyze Pfam domains and gene proximity. This approach led to the identification of novel alt-like gene clusters, showcasing the potential of artificial intelligence to reveal elusive and underrepresented genetic clusters and enhance our understanding of their diversity and role across various bacterial genera.
期刊介绍:
mSphere™ is a multi-disciplinary open-access journal that will focus on rapid publication of fundamental contributions to our understanding of microbiology. Its scope will reflect the immense range of fields within the microbial sciences, creating new opportunities for researchers to share findings that are transforming our understanding of human health and disease, ecosystems, neuroscience, agriculture, energy production, climate change, evolution, biogeochemical cycling, and food and drug production. Submissions will be encouraged of all high-quality work that makes fundamental contributions to our understanding of microbiology. mSphere™ will provide streamlined decisions, while carrying on ASM''s tradition for rigorous peer review.