{"title":"人工智能工具可调整遗传数据中的祖先偏差","authors":"Iris Marchal","doi":"10.1038/s41587-025-02651-7","DOIUrl":null,"url":null,"abstract":"<p>Human ancestry has a considerable impact on gene expression, but genomic datasets for disease analysis severely underrepresent non-European populations, thereby limiting the advancement of precision medicine. In a paper in <i>Nature Communications</i>, Smith et al. introduce a machine learning tool to mitigate the effects of ancestral bias in transcriptomic data.</p><p>The tool, called PhyloFrame, creates ancestry-aware signatures of disease by integrating population genomics data with smaller, disease-relevant training datasets. PhyloFrame uses a logistic regression model with LASSO penalty to obtain an initial set of disease-relevant genes. It then uses population genomics data to help compensate for data distribution shifts caused by human ancestry differences. In short, PhyloFrame projects the initial disease signature onto a functional interaction network, extending the network to include the first and second neighbors of each signature gene. This new set is then filtered by a statistic defined as enhanced allele frequency (EAF) — which captures population-specific allelic enrichment in healthy tissue — to identify ancestrally diverse genes that interact with the original signature. From each ancestry, a selected subset of genes with high EAF and gene expression variability in the training data are added to the PhyloFrame signature. Retraining the model with the forced inclusion of these equitable genes results in a signature of disease that generalizes to all populations, even if not represented in the training data.</p>","PeriodicalId":19084,"journal":{"name":"Nature biotechnology","volume":"7 1","pages":""},"PeriodicalIF":33.1000,"publicationDate":"2025-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"AI tool adjusts for ancestral bias in genetic data\",\"authors\":\"Iris Marchal\",\"doi\":\"10.1038/s41587-025-02651-7\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p>Human ancestry has a considerable impact on gene expression, but genomic datasets for disease analysis severely underrepresent non-European populations, thereby limiting the advancement of precision medicine. In a paper in <i>Nature Communications</i>, Smith et al. introduce a machine learning tool to mitigate the effects of ancestral bias in transcriptomic data.</p><p>The tool, called PhyloFrame, creates ancestry-aware signatures of disease by integrating population genomics data with smaller, disease-relevant training datasets. PhyloFrame uses a logistic regression model with LASSO penalty to obtain an initial set of disease-relevant genes. It then uses population genomics data to help compensate for data distribution shifts caused by human ancestry differences. In short, PhyloFrame projects the initial disease signature onto a functional interaction network, extending the network to include the first and second neighbors of each signature gene. This new set is then filtered by a statistic defined as enhanced allele frequency (EAF) — which captures population-specific allelic enrichment in healthy tissue — to identify ancestrally diverse genes that interact with the original signature. From each ancestry, a selected subset of genes with high EAF and gene expression variability in the training data are added to the PhyloFrame signature. Retraining the model with the forced inclusion of these equitable genes results in a signature of disease that generalizes to all populations, even if not represented in the training data.</p>\",\"PeriodicalId\":19084,\"journal\":{\"name\":\"Nature biotechnology\",\"volume\":\"7 1\",\"pages\":\"\"},\"PeriodicalIF\":33.1000,\"publicationDate\":\"2025-04-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Nature biotechnology\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://doi.org/10.1038/s41587-025-02651-7\",\"RegionNum\":1,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"BIOTECHNOLOGY & APPLIED MICROBIOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nature biotechnology","FirstCategoryId":"5","ListUrlMain":"https://doi.org/10.1038/s41587-025-02651-7","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOTECHNOLOGY & APPLIED MICROBIOLOGY","Score":null,"Total":0}
AI tool adjusts for ancestral bias in genetic data
Human ancestry has a considerable impact on gene expression, but genomic datasets for disease analysis severely underrepresent non-European populations, thereby limiting the advancement of precision medicine. In a paper in Nature Communications, Smith et al. introduce a machine learning tool to mitigate the effects of ancestral bias in transcriptomic data.
The tool, called PhyloFrame, creates ancestry-aware signatures of disease by integrating population genomics data with smaller, disease-relevant training datasets. PhyloFrame uses a logistic regression model with LASSO penalty to obtain an initial set of disease-relevant genes. It then uses population genomics data to help compensate for data distribution shifts caused by human ancestry differences. In short, PhyloFrame projects the initial disease signature onto a functional interaction network, extending the network to include the first and second neighbors of each signature gene. This new set is then filtered by a statistic defined as enhanced allele frequency (EAF) — which captures population-specific allelic enrichment in healthy tissue — to identify ancestrally diverse genes that interact with the original signature. From each ancestry, a selected subset of genes with high EAF and gene expression variability in the training data are added to the PhyloFrame signature. Retraining the model with the forced inclusion of these equitable genes results in a signature of disease that generalizes to all populations, even if not represented in the training data.
期刊介绍:
Nature Biotechnology is a monthly journal that focuses on the science and business of biotechnology. It covers a wide range of topics including technology/methodology advancements in the biological, biomedical, agricultural, and environmental sciences. The journal also explores the commercial, political, ethical, legal, and societal aspects of this research.
The journal serves researchers by providing peer-reviewed research papers in the field of biotechnology. It also serves the business community by delivering news about research developments. This approach ensures that both the scientific and business communities are well-informed and able to stay up-to-date on the latest advancements and opportunities in the field.
Some key areas of interest in which the journal actively seeks research papers include molecular engineering of nucleic acids and proteins, molecular therapy, large-scale biology, computational biology, regenerative medicine, imaging technology, analytical biotechnology, applied immunology, food and agricultural biotechnology, and environmental biotechnology.
In summary, Nature Biotechnology is a comprehensive journal that covers both the scientific and business aspects of biotechnology. It strives to provide researchers with valuable research papers and news while also delivering important scientific advancements to the business community.