PathoLM：通过基因组基础模型从 DNA 序列识别致病性

arXiv - QuanBio - Genomics Pub Date : 2024-06-19 DOI:arxiv-2406.13133

Sajib Acharjee Dip, Uddip Acharjee Shuvo, Tran Chau, Haoqiu Song, Petra Choi, Xuan Wang, Liqing Zhang

{"title":"PathoLM：通过基因组基础模型从 DNA 序列识别致病性","authors":"Sajib Acharjee Dip, Uddip Acharjee Shuvo, Tran Chau, Haoqiu Song, Petra Choi, Xuan Wang, Liqing Zhang","doi":"arxiv-2406.13133","DOIUrl":null,"url":null,"abstract":"Pathogen identification is pivotal in diagnosing, treating, and preventing\ndiseases, crucial for controlling infections and safeguarding public health.\nTraditional alignment-based methods, though widely used, are computationally\nintense and reliant on extensive reference databases, often failing to detect\nnovel pathogens due to their low sensitivity and specificity. Similarly,\nconventional machine learning techniques, while promising, require large\nannotated datasets and extensive feature engineering and are prone to\noverfitting. Addressing these challenges, we introduce PathoLM, a cutting-edge\npathogen language model optimized for the identification of pathogenicity in\nbacterial and viral sequences. Leveraging the strengths of pre-trained DNA\nmodels such as the Nucleotide Transformer, PathoLM requires minimal data for\nfine-tuning, thereby enhancing pathogen detection capabilities. It effectively\ncaptures a broader genomic context, significantly improving the identification\nof novel and divergent pathogens. We developed a comprehensive data set\ncomprising approximately 30 species of viruses and bacteria, including ESKAPEE\npathogens, seven notably virulent bacterial strains resistant to antibiotics.\nAdditionally, we curated a species classification dataset centered specifically\non the ESKAPEE group. In comparative assessments, PathoLM dramatically\noutperforms existing models like DciPatho, demonstrating robust zero-shot and\nfew-shot capabilities. Furthermore, we expanded PathoLM-Sp for ESKAPEE species\nclassification, where it showed superior performance compared to other advanced\ndeep learning methods, despite the complexities of the task.","PeriodicalId":501070,"journal":{"name":"arXiv - QuanBio - Genomics","volume":"46 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"PathoLM: Identifying pathogenicity from the DNA sequence through the Genome Foundation Model\",\"authors\":\"Sajib Acharjee Dip, Uddip Acharjee Shuvo, Tran Chau, Haoqiu Song, Petra Choi, Xuan Wang, Liqing Zhang\",\"doi\":\"arxiv-2406.13133\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Pathogen identification is pivotal in diagnosing, treating, and preventing\\ndiseases, crucial for controlling infections and safeguarding public health.\\nTraditional alignment-based methods, though widely used, are computationally\\nintense and reliant on extensive reference databases, often failing to detect\\nnovel pathogens due to their low sensitivity and specificity. Similarly,\\nconventional machine learning techniques, while promising, require large\\nannotated datasets and extensive feature engineering and are prone to\\noverfitting. Addressing these challenges, we introduce PathoLM, a cutting-edge\\npathogen language model optimized for the identification of pathogenicity in\\nbacterial and viral sequences. Leveraging the strengths of pre-trained DNA\\nmodels such as the Nucleotide Transformer, PathoLM requires minimal data for\\nfine-tuning, thereby enhancing pathogen detection capabilities. It effectively\\ncaptures a broader genomic context, significantly improving the identification\\nof novel and divergent pathogens. We developed a comprehensive data set\\ncomprising approximately 30 species of viruses and bacteria, including ESKAPEE\\npathogens, seven notably virulent bacterial strains resistant to antibiotics.\\nAdditionally, we curated a species classification dataset centered specifically\\non the ESKAPEE group. In comparative assessments, PathoLM dramatically\\noutperforms existing models like DciPatho, demonstrating robust zero-shot and\\nfew-shot capabilities. Furthermore, we expanded PathoLM-Sp for ESKAPEE species\\nclassification, where it showed superior performance compared to other advanced\\ndeep learning methods, despite the complexities of the task.\",\"PeriodicalId\":501070,\"journal\":{\"name\":\"arXiv - QuanBio - Genomics\",\"volume\":\"46 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-06-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Genomics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2406.13133\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2406.13133","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

病原体鉴定是诊断、治疗和预防疾病的关键，对控制感染和保障公共卫生至关重要。传统的基于配准的方法虽然应用广泛，但计算量大，依赖于大量的参考数据库，由于灵敏度和特异性低，往往无法检测到新的病原体。同样，传统的机器学习技术虽然前景广阔，但需要大量的标注数据集和广泛的特征工程，容易造成拟合过度。为了应对这些挑战，我们推出了 PathoLM，这是一种针对细菌和病毒序列致病性识别而优化的前沿病原体语言模型。PathoLM 充分利用了核苷酸转换器等预训练 DNA 模型的优势，只需最少的数据进行微调，从而提高了病原体检测能力。它能有效捕捉更广泛的基因组背景，大大提高了对新型和不同病原体的识别能力。我们开发了一个包含约 30 种病毒和细菌的综合数据集，其中包括 ESKAPEE 病原体，即七种对抗生素具有抗药性的显著毒性细菌菌株。在比较评估中，PathoLM 显著优于 DciPatho 等现有模型，展示了强大的零点和零点能力。此外，我们还将 PathoLM-Sp 扩展到了 ESKAPEE 物种分类中，尽管该任务非常复杂，但与其他先进的深度学习方法相比，PathoLM-Sp 表现出了卓越的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

PathoLM: Identifying pathogenicity from the DNA sequence through the Genome Foundation Model

Pathogen identification is pivotal in diagnosing, treating, and preventing diseases, crucial for controlling infections and safeguarding public health. Traditional alignment-based methods, though widely used, are computationally intense and reliant on extensive reference databases, often failing to detect novel pathogens due to their low sensitivity and specificity. Similarly, conventional machine learning techniques, while promising, require large annotated datasets and extensive feature engineering and are prone to overfitting. Addressing these challenges, we introduce PathoLM, a cutting-edge pathogen language model optimized for the identification of pathogenicity in bacterial and viral sequences. Leveraging the strengths of pre-trained DNA models such as the Nucleotide Transformer, PathoLM requires minimal data for fine-tuning, thereby enhancing pathogen detection capabilities. It effectively captures a broader genomic context, significantly improving the identification of novel and divergent pathogens. We developed a comprehensive data set comprising approximately 30 species of viruses and bacteria, including ESKAPEE pathogens, seven notably virulent bacterial strains resistant to antibiotics. Additionally, we curated a species classification dataset centered specifically on the ESKAPEE group. In comparative assessments, PathoLM dramatically outperforms existing models like DciPatho, demonstrating robust zero-shot and few-shot capabilities. Furthermore, we expanded PathoLM-Sp for ESKAPEE species classification, where it showed superior performance compared to other advanced deep learning methods, despite the complexities of the task.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - QuanBio - Genomics

自引率

0.00%

发文量