预测模型的低维基因型嵌入

Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics Pub Date : 2022-08-07 DOI:10.1145/3535508.3545507

Syed Fahad Sultan, Xingzhi Guo, S. Skiena

{"title":"预测模型的低维基因型嵌入","authors":"Syed Fahad Sultan, Xingzhi Guo, S. Skiena","doi":"10.1145/3535508.3545507","DOIUrl":null,"url":null,"abstract":"We develop methods for constructing low-dimensional vector representations (embeddings) of large-scale genotyping data, capable of reducing genotypes of hundreds of thousands of SNPs to 100-dimensional embeddings that retain substantial predictive power for inferring medical phenotypes. We demonstrate that embedding-based models yield an average F-score of 0.605 on a test of ten phenoypes (including BMI prediction, genetic relatedness, and depression) versus 0.339 for baseline models. Genotype embeddings also hold promise for creating sharing data while preserving subject anonymity: we show that they retain substantial predictive power even after anonymization by adding Gaussian noise to each dimension.","PeriodicalId":354504,"journal":{"name":"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"7 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Low-dimensional genotype embeddings for predictive models\",\"authors\":\"Syed Fahad Sultan, Xingzhi Guo, S. Skiena\",\"doi\":\"10.1145/3535508.3545507\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We develop methods for constructing low-dimensional vector representations (embeddings) of large-scale genotyping data, capable of reducing genotypes of hundreds of thousands of SNPs to 100-dimensional embeddings that retain substantial predictive power for inferring medical phenotypes. We demonstrate that embedding-based models yield an average F-score of 0.605 on a test of ten phenoypes (including BMI prediction, genetic relatedness, and depression) versus 0.339 for baseline models. Genotype embeddings also hold promise for creating sharing data while preserving subject anonymity: we show that they retain substantial predictive power even after anonymization by adding Gaussian noise to each dimension.\",\"PeriodicalId\":354504,\"journal\":{\"name\":\"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics\",\"volume\":\"7 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-08-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3535508.3545507\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3535508.3545507","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

我们开发了构建大规模基因分型数据的低维载体表示(嵌入)的方法，能够将数十万个snp的基因型减少到100维嵌入，这些嵌入保留了推断医学表型的实质性预测能力。我们证明，基于嵌入的模型在10种表型(包括BMI预测、遗传相关性和抑郁)的测试中产生的平均f分为0.605，而基线模型的平均f分为0.339。基因型嵌入也有望在保持受试者匿名性的同时创建共享数据:我们表明，即使在匿名化之后，通过向每个维度添加高斯噪声，它们仍保留了大量的预测能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Low-dimensional genotype embeddings for predictive models

We develop methods for constructing low-dimensional vector representations (embeddings) of large-scale genotyping data, capable of reducing genotypes of hundreds of thousands of SNPs to 100-dimensional embeddings that retain substantial predictive power for inferring medical phenotypes. We demonstrate that embedding-based models yield an average F-score of 0.605 on a test of ten phenoypes (including BMI prediction, genetic relatedness, and depression) versus 0.339 for baseline models. Genotype embeddings also hold promise for creating sharing data while preserving subject anonymity: we show that they retain substantial predictive power even after anonymization by adding Gaussian noise to each dimension.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics

自引率

0.00%

发文量