从k-mer频率确定种群结构

Y. Hrytsenko, Noah M. Daniels, R. Schwartz
{"title":"从k-mer频率确定种群结构","authors":"Y. Hrytsenko, Noah M. Daniels, R. Schwartz","doi":"10.1145/3535508.3545100","DOIUrl":null,"url":null,"abstract":"Determining population structure helps us understand connections among different populations and how they evolve over time. This knowledge is important for studies ranging from evolutionary biology to large-scale variant-trait association studies, such as Genome-Wide Association Studies (GWAS). Current approaches to determining population structure include model-based approaches, statistical approaches, and distance-based ancestry inference approaches. In this work, we outline an approach that identifies population structure from k-mer frequencies using principal component analysis (PCA). This approach can be classified as statistical; however, while prior work has employed PCA, here we analyze k-mer frequencies rather than multilocus genotype data (SNPs, microsatellites, or haplotypes). K-mer frequencies can be viewed as a summary statistic of a genome and have the advantage of being easily derived from a genome by counting the number of times a k-mer occurred in a sequence. No genetic assumptions must be met to generate k-mers. Current population differentiation approaches, such as structure, depend on several genetic assumptions and go through the process of a careful selection of ancestry informative markers that can be used to identify populations. In this work, we show that PCA is able to detect population structure just from the number of k-mers found in the genome. Application of PCA together with a clustering algorithm to k-mer profiles of genomes provides an easy approach to detecting a number of populations (clusters) present in the dataset. We describe the method and show that the results are comparable to those found by a model-based approach using genetic markers. We validate our method using 48 human genomes from populations identified by the 1000 Human Genomes Project. We also compared our results to those from mash, which determines relationships among individuals using the number of matched k-mers between sequences. We compare the outputs between the two approaches and discuss the sensitivity of population structure identification of both methods. This study shows that PCA is able to detect population structure from k-mer frequencies and can separate samples of admixed and non-admixed origin, whereas mash showed to be highly sensitive to the parameters of k-mer length and sketch size.","PeriodicalId":354504,"journal":{"name":"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"178 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Determining population structure from k-mer frequencies\",\"authors\":\"Y. Hrytsenko, Noah M. Daniels, R. Schwartz\",\"doi\":\"10.1145/3535508.3545100\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Determining population structure helps us understand connections among different populations and how they evolve over time. This knowledge is important for studies ranging from evolutionary biology to large-scale variant-trait association studies, such as Genome-Wide Association Studies (GWAS). Current approaches to determining population structure include model-based approaches, statistical approaches, and distance-based ancestry inference approaches. In this work, we outline an approach that identifies population structure from k-mer frequencies using principal component analysis (PCA). This approach can be classified as statistical; however, while prior work has employed PCA, here we analyze k-mer frequencies rather than multilocus genotype data (SNPs, microsatellites, or haplotypes). K-mer frequencies can be viewed as a summary statistic of a genome and have the advantage of being easily derived from a genome by counting the number of times a k-mer occurred in a sequence. No genetic assumptions must be met to generate k-mers. Current population differentiation approaches, such as structure, depend on several genetic assumptions and go through the process of a careful selection of ancestry informative markers that can be used to identify populations. In this work, we show that PCA is able to detect population structure just from the number of k-mers found in the genome. Application of PCA together with a clustering algorithm to k-mer profiles of genomes provides an easy approach to detecting a number of populations (clusters) present in the dataset. We describe the method and show that the results are comparable to those found by a model-based approach using genetic markers. We validate our method using 48 human genomes from populations identified by the 1000 Human Genomes Project. We also compared our results to those from mash, which determines relationships among individuals using the number of matched k-mers between sequences. We compare the outputs between the two approaches and discuss the sensitivity of population structure identification of both methods. This study shows that PCA is able to detect population structure from k-mer frequencies and can separate samples of admixed and non-admixed origin, whereas mash showed to be highly sensitive to the parameters of k-mer length and sketch size.\",\"PeriodicalId\":354504,\"journal\":{\"name\":\"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics\",\"volume\":\"178 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-08-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3535508.3545100\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3535508.3545100","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

确定种群结构有助于我们理解不同种群之间的联系,以及它们如何随着时间的推移而进化。这些知识对于从进化生物学到大规模变异性状关联研究(如全基因组关联研究(GWAS))的研究都很重要。目前确定种群结构的方法包括基于模型的方法、统计方法和基于距离的祖先推断方法。在这项工作中,我们概述了一种使用主成分分析(PCA)从k-mer频率识别种群结构的方法。这种方法可以分为统计学;然而,虽然之前的工作采用了PCA,但这里我们分析的是k-mer频率,而不是多位点基因型数据(snp、微卫星或单倍型)。K-mer频率可以看作是基因组的汇总统计数据,并且通过计算K-mer在序列中出现的次数,可以很容易地从基因组中推导出来。产生k-mers不需要满足任何遗传假设。目前的种群分化方法,如结构,依赖于几个遗传假设,并经过仔细选择祖先信息标记的过程,这些标记可用于识别种群。在这项工作中,我们表明PCA能够仅从基因组中发现的k-mers数量检测群体结构。将PCA与聚类算法一起应用于基因组的k-mer谱提供了一种简单的方法来检测数据集中存在的许多种群(聚类)。我们描述了该方法,并表明结果与使用遗传标记的基于模型的方法发现的结果相当。我们使用来自1000人基因组计划确定的人群的48个人类基因组验证了我们的方法。我们还将我们的结果与mash的结果进行了比较,后者使用序列之间匹配k-mers的数量来确定个体之间的关系。我们比较了两种方法的输出,并讨论了两种方法的群体结构识别的敏感性。该研究表明,PCA能够从k-mer频率中检测种群结构,并能区分混合和非混合来源的样本,而mash对k-mer长度和草图尺寸参数高度敏感。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Determining population structure from k-mer frequencies
Determining population structure helps us understand connections among different populations and how they evolve over time. This knowledge is important for studies ranging from evolutionary biology to large-scale variant-trait association studies, such as Genome-Wide Association Studies (GWAS). Current approaches to determining population structure include model-based approaches, statistical approaches, and distance-based ancestry inference approaches. In this work, we outline an approach that identifies population structure from k-mer frequencies using principal component analysis (PCA). This approach can be classified as statistical; however, while prior work has employed PCA, here we analyze k-mer frequencies rather than multilocus genotype data (SNPs, microsatellites, or haplotypes). K-mer frequencies can be viewed as a summary statistic of a genome and have the advantage of being easily derived from a genome by counting the number of times a k-mer occurred in a sequence. No genetic assumptions must be met to generate k-mers. Current population differentiation approaches, such as structure, depend on several genetic assumptions and go through the process of a careful selection of ancestry informative markers that can be used to identify populations. In this work, we show that PCA is able to detect population structure just from the number of k-mers found in the genome. Application of PCA together with a clustering algorithm to k-mer profiles of genomes provides an easy approach to detecting a number of populations (clusters) present in the dataset. We describe the method and show that the results are comparable to those found by a model-based approach using genetic markers. We validate our method using 48 human genomes from populations identified by the 1000 Human Genomes Project. We also compared our results to those from mash, which determines relationships among individuals using the number of matched k-mers between sequences. We compare the outputs between the two approaches and discuss the sensitivity of population structure identification of both methods. This study shows that PCA is able to detect population structure from k-mer frequencies and can separate samples of admixed and non-admixed origin, whereas mash showed to be highly sensitive to the parameters of k-mer length and sketch size.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信