Shengchun Qi, Shuyan Wang, Yu Xia, Songcan Chen, Huijie Lu
{"title":"基于毒力基因的机器学习方法鉴定土壤中人类病原体。","authors":"Shengchun Qi, Shuyan Wang, Yu Xia, Songcan Chen, Huijie Lu","doi":"10.1016/j.eehl.2025.100171","DOIUrl":null,"url":null,"abstract":"<p><p>Soils are important reservoirs of human pathogenic bacteria that can spread to humans through various pathways. Metagenomics enables high-throughput pathogen identification by mapping sequencing reads to known pathogen genomes. However, this approach has several limitations, e.g., sequence assembly is time-consuming, and reliance on reference databases may overlook potential pathogens lacking close genomic matches. Here, we developed a novel, virulence factor (VF) based machine learning method using the K-Nearest Neighbors model (VF-KNN) for identifying human pathogenic bacteria from soil metagenomes. Through learning the VF features of pathogenic and non-pathogenic bacteria, VF-KNN could achieve the desired performance in soil pathogen identification (AUC: 0.95, Accuracy: 0.85). Model prediction accuracy (0.95) was further validated using 61 pathogenic strains isolated from soil. For the top 15 most frequent soil pathogens, the prediction accuracy was >0.90 at 0.4X-1.0X genome coverage. VFs contributing significantly to pathogen identification were associated with regulation, effector delivery, motility, etc. By using VF-KNN, the averaged abundance of total potential pathogens in topsoils across China was 0.44% (<i>n</i> = 336), predominantly concentrated in the eastern coastal provinces. Compared with the conventional method based on a predefined pathogen list, VF-KNN identified 28% more potential pathogenic species, including some newly reported but not in the predefined list (e.g., <i>Mycolicibacterium cosmeticum</i>). Agricultural land exhibited significantly higher pathogen abundance and diversity than the other land types. This newly developed VF-KNN method is applicable for pathogen identification in broader environments.</p>","PeriodicalId":29813,"journal":{"name":"Eco-Environment & Health","volume":"4 3","pages":"100171"},"PeriodicalIF":17.6000,"publicationDate":"2025-07-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12355066/pdf/","citationCount":"0","resultStr":"{\"title\":\"Identification of human pathogens in soil by virulence gene-based machine learning method.\",\"authors\":\"Shengchun Qi, Shuyan Wang, Yu Xia, Songcan Chen, Huijie Lu\",\"doi\":\"10.1016/j.eehl.2025.100171\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Soils are important reservoirs of human pathogenic bacteria that can spread to humans through various pathways. Metagenomics enables high-throughput pathogen identification by mapping sequencing reads to known pathogen genomes. However, this approach has several limitations, e.g., sequence assembly is time-consuming, and reliance on reference databases may overlook potential pathogens lacking close genomic matches. Here, we developed a novel, virulence factor (VF) based machine learning method using the K-Nearest Neighbors model (VF-KNN) for identifying human pathogenic bacteria from soil metagenomes. Through learning the VF features of pathogenic and non-pathogenic bacteria, VF-KNN could achieve the desired performance in soil pathogen identification (AUC: 0.95, Accuracy: 0.85). Model prediction accuracy (0.95) was further validated using 61 pathogenic strains isolated from soil. For the top 15 most frequent soil pathogens, the prediction accuracy was >0.90 at 0.4X-1.0X genome coverage. VFs contributing significantly to pathogen identification were associated with regulation, effector delivery, motility, etc. By using VF-KNN, the averaged abundance of total potential pathogens in topsoils across China was 0.44% (<i>n</i> = 336), predominantly concentrated in the eastern coastal provinces. Compared with the conventional method based on a predefined pathogen list, VF-KNN identified 28% more potential pathogenic species, including some newly reported but not in the predefined list (e.g., <i>Mycolicibacterium cosmeticum</i>). Agricultural land exhibited significantly higher pathogen abundance and diversity than the other land types. This newly developed VF-KNN method is applicable for pathogen identification in broader environments.</p>\",\"PeriodicalId\":29813,\"journal\":{\"name\":\"Eco-Environment & Health\",\"volume\":\"4 3\",\"pages\":\"100171\"},\"PeriodicalIF\":17.6000,\"publicationDate\":\"2025-07-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12355066/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Eco-Environment & Health\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1016/j.eehl.2025.100171\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/9/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Eco-Environment & Health","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1016/j.eehl.2025.100171","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/9/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}
Identification of human pathogens in soil by virulence gene-based machine learning method.
Soils are important reservoirs of human pathogenic bacteria that can spread to humans through various pathways. Metagenomics enables high-throughput pathogen identification by mapping sequencing reads to known pathogen genomes. However, this approach has several limitations, e.g., sequence assembly is time-consuming, and reliance on reference databases may overlook potential pathogens lacking close genomic matches. Here, we developed a novel, virulence factor (VF) based machine learning method using the K-Nearest Neighbors model (VF-KNN) for identifying human pathogenic bacteria from soil metagenomes. Through learning the VF features of pathogenic and non-pathogenic bacteria, VF-KNN could achieve the desired performance in soil pathogen identification (AUC: 0.95, Accuracy: 0.85). Model prediction accuracy (0.95) was further validated using 61 pathogenic strains isolated from soil. For the top 15 most frequent soil pathogens, the prediction accuracy was >0.90 at 0.4X-1.0X genome coverage. VFs contributing significantly to pathogen identification were associated with regulation, effector delivery, motility, etc. By using VF-KNN, the averaged abundance of total potential pathogens in topsoils across China was 0.44% (n = 336), predominantly concentrated in the eastern coastal provinces. Compared with the conventional method based on a predefined pathogen list, VF-KNN identified 28% more potential pathogenic species, including some newly reported but not in the predefined list (e.g., Mycolicibacterium cosmeticum). Agricultural land exhibited significantly higher pathogen abundance and diversity than the other land types. This newly developed VF-KNN method is applicable for pathogen identification in broader environments.
期刊介绍:
Eco-Environment & Health (EEH) is an international and multidisciplinary peer-reviewed journal designed for publications on the frontiers of the ecology, environment and health as well as their related disciplines. EEH focuses on the concept of “One Health” to promote green and sustainable development, dealing with the interactions among ecology, environment and health, and the underlying mechanisms and interventions. Our mission is to be one of the most important flagship journals in the field of environmental health.
Scopes
EEH covers a variety of research areas, including but not limited to ecology and biodiversity conservation, environmental behaviors and bioprocesses of emerging contaminants, human exposure and health effects, and evaluation, management and regulation of environmental risks. The key topics of EEH include:
1) Ecology and Biodiversity Conservation
Biodiversity
Ecological restoration
Ecological safety
Protected area
2) Environmental and Biological Fate of Emerging Contaminants
Environmental behaviors
Environmental processes
Environmental microbiology
3) Human Exposure and Health Effects
Environmental toxicology
Environmental epidemiology
Environmental health risk
Food safety
4) Evaluation, Management and Regulation of Environmental Risks
Chemical safety
Environmental policy
Health policy
Health economics
Environmental remediation