{"title":"基于字符串核的16S rRNA基因测序数据集系统发育建模。","authors":"Jonathan Ish-Horowicz , Sarah Filippi","doi":"10.1016/j.jtbi.2025.112249","DOIUrl":null,"url":null,"abstract":"<div><div>The bacterial microbiome is increasingly being recognised as a key factor in human health, driven in large part by datasets collected using 16S rRNA (ribosomal ribonucleic acid) gene sequencing, which enable cost-effective quantification of the composition of an individual’s bacterial community. One of the defining characteristics of 16S rRNA datasets is the evolutionary relationships that exist between taxa (phylogeny). Here, we demonstrate the utility of modelling these phylogenetic relationships in two statistical tasks (the two sample test and host trait prediction) and propose a novel family of kernels for analysing microbiome datasets by leveraging string kernels from the natural language processing literature. We show via simulation studies that a kernel two-sample test using the proposed kernel is sensitive to the phylogenetic scale of the difference between the two populations. In a second set of simulations we also show how Gaussian process modelling with string kernels can infer the distribution of bacterial-host effects across the phylogenetic tree and apply this approach to a real host-trait prediction task. The results in the paper can be reproduced by running the code at <span><span>https://github.com/jonathanishhorowicz/modelling_phylogeny_in_16srrna_using_string_kernels</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":54763,"journal":{"name":"Journal of Theoretical Biology","volume":"616 ","pages":"Article 112249"},"PeriodicalIF":2.0000,"publicationDate":"2025-09-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Modelling phylogeny in 16S rRNA gene sequencing datasets using string-based kernels\",\"authors\":\"Jonathan Ish-Horowicz , Sarah Filippi\",\"doi\":\"10.1016/j.jtbi.2025.112249\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The bacterial microbiome is increasingly being recognised as a key factor in human health, driven in large part by datasets collected using 16S rRNA (ribosomal ribonucleic acid) gene sequencing, which enable cost-effective quantification of the composition of an individual’s bacterial community. One of the defining characteristics of 16S rRNA datasets is the evolutionary relationships that exist between taxa (phylogeny). Here, we demonstrate the utility of modelling these phylogenetic relationships in two statistical tasks (the two sample test and host trait prediction) and propose a novel family of kernels for analysing microbiome datasets by leveraging string kernels from the natural language processing literature. We show via simulation studies that a kernel two-sample test using the proposed kernel is sensitive to the phylogenetic scale of the difference between the two populations. In a second set of simulations we also show how Gaussian process modelling with string kernels can infer the distribution of bacterial-host effects across the phylogenetic tree and apply this approach to a real host-trait prediction task. The results in the paper can be reproduced by running the code at <span><span>https://github.com/jonathanishhorowicz/modelling_phylogeny_in_16srrna_using_string_kernels</span><svg><path></path></svg></span>.</div></div>\",\"PeriodicalId\":54763,\"journal\":{\"name\":\"Journal of Theoretical Biology\",\"volume\":\"616 \",\"pages\":\"Article 112249\"},\"PeriodicalIF\":2.0000,\"publicationDate\":\"2025-09-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Theoretical Biology\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0022519325002152\",\"RegionNum\":4,\"RegionCategory\":\"数学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"BIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Theoretical Biology","FirstCategoryId":"99","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0022519325002152","RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"BIOLOGY","Score":null,"Total":0}
Modelling phylogeny in 16S rRNA gene sequencing datasets using string-based kernels
The bacterial microbiome is increasingly being recognised as a key factor in human health, driven in large part by datasets collected using 16S rRNA (ribosomal ribonucleic acid) gene sequencing, which enable cost-effective quantification of the composition of an individual’s bacterial community. One of the defining characteristics of 16S rRNA datasets is the evolutionary relationships that exist between taxa (phylogeny). Here, we demonstrate the utility of modelling these phylogenetic relationships in two statistical tasks (the two sample test and host trait prediction) and propose a novel family of kernels for analysing microbiome datasets by leveraging string kernels from the natural language processing literature. We show via simulation studies that a kernel two-sample test using the proposed kernel is sensitive to the phylogenetic scale of the difference between the two populations. In a second set of simulations we also show how Gaussian process modelling with string kernels can infer the distribution of bacterial-host effects across the phylogenetic tree and apply this approach to a real host-trait prediction task. The results in the paper can be reproduced by running the code at https://github.com/jonathanishhorowicz/modelling_phylogeny_in_16srrna_using_string_kernels.
期刊介绍:
The Journal of Theoretical Biology is the leading forum for theoretical perspectives that give insight into biological processes. It covers a very wide range of topics and is of interest to biologists in many areas of research, including:
• Brain and Neuroscience
• Cancer Growth and Treatment
• Cell Biology
• Developmental Biology
• Ecology
• Evolution
• Immunology,
• Infectious and non-infectious Diseases,
• Mathematical, Computational, Biophysical and Statistical Modeling
• Microbiology, Molecular Biology, and Biochemistry
• Networks and Complex Systems
• Physiology
• Pharmacodynamics
• Animal Behavior and Game Theory
Acceptable papers are those that bear significant importance on the biology per se being presented, and not on the mathematical analysis. Papers that include some data or experimental material bearing on theory will be considered, including those that contain comparative study, statistical data analysis, mathematical proof, computer simulations, experiments, field observations, or even philosophical arguments, which are all methods to support or reject theoretical ideas. However, there should be a concerted effort to make papers intelligible to biologists in the chosen field.