{"title":"从k-mer频率确定种群结构","authors":"Y. Hrytsenko, Noah M. Daniels, R. Schwartz","doi":"10.1145/3535508.3545100","DOIUrl":null,"url":null,"abstract":"Determining population structure helps us understand connections among different populations and how they evolve over time. This knowledge is important for studies ranging from evolutionary biology to large-scale variant-trait association studies, such as Genome-Wide Association Studies (GWAS). Current approaches to determining population structure include model-based approaches, statistical approaches, and distance-based ancestry inference approaches. In this work, we outline an approach that identifies population structure from k-mer frequencies using principal component analysis (PCA). This approach can be classified as statistical; however, while prior work has employed PCA, here we analyze k-mer frequencies rather than multilocus genotype data (SNPs, microsatellites, or haplotypes). K-mer frequencies can be viewed as a summary statistic of a genome and have the advantage of being easily derived from a genome by counting the number of times a k-mer occurred in a sequence. No genetic assumptions must be met to generate k-mers. Current population differentiation approaches, such as structure, depend on several genetic assumptions and go through the process of a careful selection of ancestry informative markers that can be used to identify populations. In this work, we show that PCA is able to detect population structure just from the number of k-mers found in the genome. Application of PCA together with a clustering algorithm to k-mer profiles of genomes provides an easy approach to detecting a number of populations (clusters) present in the dataset. We describe the method and show that the results are comparable to those found by a model-based approach using genetic markers. We validate our method using 48 human genomes from populations identified by the 1000 Human Genomes Project. We also compared our results to those from mash, which determines relationships among individuals using the number of matched k-mers between sequences. We compare the outputs between the two approaches and discuss the sensitivity of population structure identification of both methods. This study shows that PCA is able to detect population structure from k-mer frequencies and can separate samples of admixed and non-admixed origin, whereas mash showed to be highly sensitive to the parameters of k-mer length and sketch size.","PeriodicalId":354504,"journal":{"name":"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","volume":"178 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Determining population structure from k-mer frequencies\",\"authors\":\"Y. Hrytsenko, Noah M. Daniels, R. Schwartz\",\"doi\":\"10.1145/3535508.3545100\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Determining population structure helps us understand connections among different populations and how they evolve over time. This knowledge is important for studies ranging from evolutionary biology to large-scale variant-trait association studies, such as Genome-Wide Association Studies (GWAS). Current approaches to determining population structure include model-based approaches, statistical approaches, and distance-based ancestry inference approaches. In this work, we outline an approach that identifies population structure from k-mer frequencies using principal component analysis (PCA). This approach can be classified as statistical; however, while prior work has employed PCA, here we analyze k-mer frequencies rather than multilocus genotype data (SNPs, microsatellites, or haplotypes). K-mer frequencies can be viewed as a summary statistic of a genome and have the advantage of being easily derived from a genome by counting the number of times a k-mer occurred in a sequence. No genetic assumptions must be met to generate k-mers. Current population differentiation approaches, such as structure, depend on several genetic assumptions and go through the process of a careful selection of ancestry informative markers that can be used to identify populations. In this work, we show that PCA is able to detect population structure just from the number of k-mers found in the genome. Application of PCA together with a clustering algorithm to k-mer profiles of genomes provides an easy approach to detecting a number of populations (clusters) present in the dataset. We describe the method and show that the results are comparable to those found by a model-based approach using genetic markers. We validate our method using 48 human genomes from populations identified by the 1000 Human Genomes Project. We also compared our results to those from mash, which determines relationships among individuals using the number of matched k-mers between sequences. We compare the outputs between the two approaches and discuss the sensitivity of population structure identification of both methods. This study shows that PCA is able to detect population structure from k-mer frequencies and can separate samples of admixed and non-admixed origin, whereas mash showed to be highly sensitive to the parameters of k-mer length and sketch size.\",\"PeriodicalId\":354504,\"journal\":{\"name\":\"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics\",\"volume\":\"178 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-08-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3535508.3545100\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 13th ACM International Conference on Bioinformatics, Computational Biology and Health Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3535508.3545100","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Determining population structure from k-mer frequencies
Determining population structure helps us understand connections among different populations and how they evolve over time. This knowledge is important for studies ranging from evolutionary biology to large-scale variant-trait association studies, such as Genome-Wide Association Studies (GWAS). Current approaches to determining population structure include model-based approaches, statistical approaches, and distance-based ancestry inference approaches. In this work, we outline an approach that identifies population structure from k-mer frequencies using principal component analysis (PCA). This approach can be classified as statistical; however, while prior work has employed PCA, here we analyze k-mer frequencies rather than multilocus genotype data (SNPs, microsatellites, or haplotypes). K-mer frequencies can be viewed as a summary statistic of a genome and have the advantage of being easily derived from a genome by counting the number of times a k-mer occurred in a sequence. No genetic assumptions must be met to generate k-mers. Current population differentiation approaches, such as structure, depend on several genetic assumptions and go through the process of a careful selection of ancestry informative markers that can be used to identify populations. In this work, we show that PCA is able to detect population structure just from the number of k-mers found in the genome. Application of PCA together with a clustering algorithm to k-mer profiles of genomes provides an easy approach to detecting a number of populations (clusters) present in the dataset. We describe the method and show that the results are comparable to those found by a model-based approach using genetic markers. We validate our method using 48 human genomes from populations identified by the 1000 Human Genomes Project. We also compared our results to those from mash, which determines relationships among individuals using the number of matched k-mers between sequences. We compare the outputs between the two approaches and discuss the sensitivity of population structure identification of both methods. This study shows that PCA is able to detect population structure from k-mer frequencies and can separate samples of admixed and non-admixed origin, whereas mash showed to be highly sensitive to the parameters of k-mer length and sketch size.