HAPNEST:高效、大规模地生成和评估基因型和表型的合成数据集。

IF 4.4 3区 生物学 Q1 BIOCHEMICAL RESEARCH METHODS
Sophie Wharrie, Zhiyu Yang, Vishnu Raj, Remo Monti, Rahul Gupta, Ying Wang, Alicia Martin, Luke J O'Connor, Samuel Kaski, Pekka Marttinen, Pier Francesco Palamara, Christoph Lippert, Andrea Ganna
{"title":"HAPNEST:高效、大规模地生成和评估基因型和表型的合成数据集。","authors":"Sophie Wharrie,&nbsp;Zhiyu Yang,&nbsp;Vishnu Raj,&nbsp;Remo Monti,&nbsp;Rahul Gupta,&nbsp;Ying Wang,&nbsp;Alicia Martin,&nbsp;Luke J O'Connor,&nbsp;Samuel Kaski,&nbsp;Pekka Marttinen,&nbsp;Pier Francesco Palamara,&nbsp;Christoph Lippert,&nbsp;Andrea Ganna","doi":"10.1093/bioinformatics/btad535","DOIUrl":null,"url":null,"abstract":"<p><strong>Motivation: </strong>Existing methods for simulating synthetic genotype and phenotype datasets have limited scalability, constraining their usability for large-scale analyses. Moreover, a systematic approach for evaluating synthetic data quality and a benchmark synthetic dataset for developing and evaluating methods for polygenic risk scores are lacking.</p><p><strong>Results: </strong>We present HAPNEST, a novel approach for efficiently generating diverse individual-level genotypic and phenotypic data. In comparison to alternative methods, HAPNEST shows faster computational speed and a lower degree of relatedness with reference panels, while generating datasets that preserve key statistical properties of real data. These desirable synthetic data properties enabled us to generate 6.8 million common variants and nine phenotypes with varying degrees of heritability and polygenicity across 1 million individuals. We demonstrate how HAPNEST can facilitate biobank-scale analyses through the comparison of seven methods to generate polygenic risk scoring across multiple ancestry groups and different genetic architectures.</p><p><strong>Availability and implementation: </strong>A synthetic dataset of 1 008 000 individuals and nine traits for 6.8 million common variants is available at https://www.ebi.ac.uk/biostudies/studies/S-BSST936. The HAPNEST software for generating synthetic datasets is available as Docker/Singularity containers and open source Julia and C code at https://github.com/intervene-EU-H2020/synthetic_data.</p>","PeriodicalId":8903,"journal":{"name":"Bioinformatics","volume":"39 9","pages":""},"PeriodicalIF":4.4000,"publicationDate":"2023-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10493177/pdf/","citationCount":"0","resultStr":"{\"title\":\"HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes.\",\"authors\":\"Sophie Wharrie,&nbsp;Zhiyu Yang,&nbsp;Vishnu Raj,&nbsp;Remo Monti,&nbsp;Rahul Gupta,&nbsp;Ying Wang,&nbsp;Alicia Martin,&nbsp;Luke J O'Connor,&nbsp;Samuel Kaski,&nbsp;Pekka Marttinen,&nbsp;Pier Francesco Palamara,&nbsp;Christoph Lippert,&nbsp;Andrea Ganna\",\"doi\":\"10.1093/bioinformatics/btad535\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Motivation: </strong>Existing methods for simulating synthetic genotype and phenotype datasets have limited scalability, constraining their usability for large-scale analyses. Moreover, a systematic approach for evaluating synthetic data quality and a benchmark synthetic dataset for developing and evaluating methods for polygenic risk scores are lacking.</p><p><strong>Results: </strong>We present HAPNEST, a novel approach for efficiently generating diverse individual-level genotypic and phenotypic data. In comparison to alternative methods, HAPNEST shows faster computational speed and a lower degree of relatedness with reference panels, while generating datasets that preserve key statistical properties of real data. These desirable synthetic data properties enabled us to generate 6.8 million common variants and nine phenotypes with varying degrees of heritability and polygenicity across 1 million individuals. We demonstrate how HAPNEST can facilitate biobank-scale analyses through the comparison of seven methods to generate polygenic risk scoring across multiple ancestry groups and different genetic architectures.</p><p><strong>Availability and implementation: </strong>A synthetic dataset of 1 008 000 individuals and nine traits for 6.8 million common variants is available at https://www.ebi.ac.uk/biostudies/studies/S-BSST936. The HAPNEST software for generating synthetic datasets is available as Docker/Singularity containers and open source Julia and C code at https://github.com/intervene-EU-H2020/synthetic_data.</p>\",\"PeriodicalId\":8903,\"journal\":{\"name\":\"Bioinformatics\",\"volume\":\"39 9\",\"pages\":\"\"},\"PeriodicalIF\":4.4000,\"publicationDate\":\"2023-09-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10493177/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Bioinformatics\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1093/bioinformatics/btad535\",\"RegionNum\":3,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"BIOCHEMICAL RESEARCH METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btad535","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMICAL RESEARCH METHODS","Score":null,"Total":0}
引用次数: 0

摘要

动机:现有的模拟合成基因型和表型数据集的方法具有有限的可扩展性,限制了它们用于大规模分析的可用性。此外,还缺乏评估合成数据质量的系统方法和用于开发和评估多基因风险评分方法的基准合成数据集。结果:我们提出了happnest,一种有效生成不同个体水平基因型和表型数据的新方法。与其他方法相比,HAPNEST的计算速度更快,与参考面板的相关度更低,同时生成的数据集保留了真实数据的关键统计属性。这些理想的合成数据特性使我们能够在100万个个体中产生680万个常见变异和9种具有不同程度遗传性和多基因性的表型。我们展示了HAPNEST如何通过比较七种方法来促进生物库规模的分析,从而在多个祖先群体和不同的遗传结构中生成多基因风险评分。可用性和实现:在https://www.ebi.ac.uk/biostudies/studies/S-BSST936上可以获得一个包含1008,000个个体和9个特征的680万个常见变异的合成数据集。用于生成合成数据集的happnest软件可以在https://github.com/intervene-EU-H2020/synthetic_data上以Docker/Singularity容器和开源Julia和C代码的形式获得。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes.

HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes.

HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes.

HAPNEST: efficient, large-scale generation and evaluation of synthetic datasets for genotypes and phenotypes.

Motivation: Existing methods for simulating synthetic genotype and phenotype datasets have limited scalability, constraining their usability for large-scale analyses. Moreover, a systematic approach for evaluating synthetic data quality and a benchmark synthetic dataset for developing and evaluating methods for polygenic risk scores are lacking.

Results: We present HAPNEST, a novel approach for efficiently generating diverse individual-level genotypic and phenotypic data. In comparison to alternative methods, HAPNEST shows faster computational speed and a lower degree of relatedness with reference panels, while generating datasets that preserve key statistical properties of real data. These desirable synthetic data properties enabled us to generate 6.8 million common variants and nine phenotypes with varying degrees of heritability and polygenicity across 1 million individuals. We demonstrate how HAPNEST can facilitate biobank-scale analyses through the comparison of seven methods to generate polygenic risk scoring across multiple ancestry groups and different genetic architectures.

Availability and implementation: A synthetic dataset of 1 008 000 individuals and nine traits for 6.8 million common variants is available at https://www.ebi.ac.uk/biostudies/studies/S-BSST936. The HAPNEST software for generating synthetic datasets is available as Docker/Singularity containers and open source Julia and C code at https://github.com/intervene-EU-H2020/synthetic_data.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Bioinformatics
Bioinformatics 生物-生化研究方法
CiteScore
11.20
自引率
5.20%
发文量
753
审稿时长
2.1 months
期刊介绍: The leading journal in its field, Bioinformatics publishes the highest quality scientific papers and review articles of interest to academic and industrial researchers. Its main focus is on new developments in genome bioinformatics and computational biology. Two distinct sections within the journal - Discovery Notes and Application Notes- focus on shorter papers; the former reporting biologically interesting discoveries using computational methods, the latter exploring the applications used for experiments.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信