利用 GAN 对长遗传序列进行潜在生成建模

bioRxiv - Genomics Pub Date : 2024-08-07 DOI:10.1101/2024.08.07.607012

Antoine Szatkownik, Cyril Furtlehner, Guillaume Charpiat, Burak Yelmen, Flora Jay

{"title":"利用 GAN 对长遗传序列进行潜在生成建模","authors":"Antoine Szatkownik, Cyril Furtlehner, Guillaume Charpiat, Burak Yelmen, Flora Jay","doi":"10.1101/2024.08.07.607012","DOIUrl":null,"url":null,"abstract":"Synthetic data generation via generative modeling has recently become a prominent research field in genomics, with applications ranging from functional sequence design to high-quality, privacy-preserving artificial in silico genomes. Following a body of work on Artificial Genomes (AGs) created via various generative models trained with raw genomic input, we propose a conceptually different approach to address the issues of scalability and complexity of genomic data generation in very high dimensions. Our method combines dimensionality reduction, achieved by Principal Component Analysis (PCA), and a Generative Adversarial Network (GAN) learning in this reduced space. Using this framework, we generated genomic proxy datasets for very diverse human populations around the world. We compared the quality of AGs generated by our approach with AGs generated by the established models and report improvements in capturing population structure, linkage disequilibrium, and metrics related to privacy leakage. Furthermore, we developed a frugal model with orders of magnitude fewer parameters and comparable performance to larger models. For quality assessment, we also implemented a new evaluation metric based on information theory to measure local haplotypic diversity, showing that generative models yield higher diversity than real genomes. In addition, we addressed the shrinkage issue associated with PCA and generative modeling, examined its relation to the nearest neighbor resemblance metric, and proposed a resolution. Finally, we evaluated the effect of different binarization methods on the quality of the output AGs.","PeriodicalId":501161,"journal":{"name":"bioRxiv - Genomics","volume":"199 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Latent generative modeling of long genetic sequences with GANs\",\"authors\":\"Antoine Szatkownik, Cyril Furtlehner, Guillaume Charpiat, Burak Yelmen, Flora Jay\",\"doi\":\"10.1101/2024.08.07.607012\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Synthetic data generation via generative modeling has recently become a prominent research field in genomics, with applications ranging from functional sequence design to high-quality, privacy-preserving artificial in silico genomes. Following a body of work on Artificial Genomes (AGs) created via various generative models trained with raw genomic input, we propose a conceptually different approach to address the issues of scalability and complexity of genomic data generation in very high dimensions. Our method combines dimensionality reduction, achieved by Principal Component Analysis (PCA), and a Generative Adversarial Network (GAN) learning in this reduced space. Using this framework, we generated genomic proxy datasets for very diverse human populations around the world. We compared the quality of AGs generated by our approach with AGs generated by the established models and report improvements in capturing population structure, linkage disequilibrium, and metrics related to privacy leakage. Furthermore, we developed a frugal model with orders of magnitude fewer parameters and comparable performance to larger models. For quality assessment, we also implemented a new evaluation metric based on information theory to measure local haplotypic diversity, showing that generative models yield higher diversity than real genomes. In addition, we addressed the shrinkage issue associated with PCA and generative modeling, examined its relation to the nearest neighbor resemblance metric, and proposed a resolution. Finally, we evaluated the effect of different binarization methods on the quality of the output AGs.\",\"PeriodicalId\":501161,\"journal\":{\"name\":\"bioRxiv - Genomics\",\"volume\":\"199 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"bioRxiv - Genomics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1101/2024.08.07.607012\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"bioRxiv - Genomics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1101/2024.08.07.607012","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

通过生成模型生成合成数据最近已成为基因组学的一个重要研究领域，其应用范围从功能序列设计到高质量、保护隐私的人工硅学基因组。在利用原始基因组输入训练的各种生成模型创建人工基因组（AGs）的大量工作之后，我们提出了一种概念上不同的方法，以解决高维度基因组数据生成的可扩展性和复杂性问题。我们的方法结合了通过主成分分析（PCA）实现的降维和在降维空间中学习的生成对抗网络（GAN）。利用这一框架，我们生成了世界各地不同人类群体的基因组代理数据集。我们将我们的方法生成的 AGs 的质量与现有模型生成的 AGs 的质量进行了比较，并报告了在捕捉种群结构、连锁不平衡和隐私泄露相关指标方面的改进。此外，我们还开发了一种节俭型模型，其参数数量少，性能与大型模型相当。在质量评估方面，我们还采用了一种基于信息论的新评估指标来衡量局部单倍型多样性，结果表明生成模型产生的多样性高于真实基因组。此外，我们还解决了与 PCA 和生成模型相关的收缩问题，研究了其与近邻相似度指标的关系，并提出了解决方法。最后，我们评估了不同二值化方法对输出 AG 质量的影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Latent generative modeling of long genetic sequences with GANs

Synthetic data generation via generative modeling has recently become a prominent research field in genomics, with applications ranging from functional sequence design to high-quality, privacy-preserving artificial in silico genomes. Following a body of work on Artificial Genomes (AGs) created via various generative models trained with raw genomic input, we propose a conceptually different approach to address the issues of scalability and complexity of genomic data generation in very high dimensions. Our method combines dimensionality reduction, achieved by Principal Component Analysis (PCA), and a Generative Adversarial Network (GAN) learning in this reduced space. Using this framework, we generated genomic proxy datasets for very diverse human populations around the world. We compared the quality of AGs generated by our approach with AGs generated by the established models and report improvements in capturing population structure, linkage disequilibrium, and metrics related to privacy leakage. Furthermore, we developed a frugal model with orders of magnitude fewer parameters and comparable performance to larger models. For quality assessment, we also implemented a new evaluation metric based on information theory to measure local haplotypic diversity, showing that generative models yield higher diversity than real genomes. In addition, we addressed the shrinkage issue associated with PCA and generative modeling, examined its relation to the nearest neighbor resemblance metric, and proposed a resolution. Finally, we evaluated the effect of different binarization methods on the quality of the output AGs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

bioRxiv - Genomics

自引率

0.00%

发文量