K Birchard, C Boccia, H Lounder, L Colston-Nepali, V L Friesen
{"title":"Popfinder: A Highly Effective Artificial Neural Network Package for Genetic Population Assignment.","authors":"K Birchard, C Boccia, H Lounder, L Colston-Nepali, V L Friesen","doi":"10.1111/1755-0998.14096","DOIUrl":null,"url":null,"abstract":"<p><p>The ability to assign biological samples to source populations with high accuracy and precision based on genetic variation is important for numerous applications from ecological studies through wildlife conservation to epidemiology. However, population assignment when genetic differentiation is low is challenging, and methods to address this problem are lacking. The application of artificial neural networks to population assignment using genomic data is highly promising. Here we present popfinder: a new, easy-to-use Python-based artificial neural network pipeline for genetic population assignment. We tested popfinder both with simulated genetic data from populations connected by varying levels of gene flow and with reduced-representation sequence data for three species of seabirds with weak to no population genetic structure. Popfinder was able to assign individuals to their source populations with high accuracy, precision and recall in most cases, including both simulated and empirical data sets, except in the empirical data set with the weakest population structure, where the comparator programs also performed poorly. Compared to other available software, popfinder was slower on the simulated data sets due to hyperparameter tuning and the fact that it does not reduce the dimensionality of the data set; however, all programs ran in seconds on empirical data sets. Additionally, popfinder provides a perturbation ranking method to help develop optimised SNP panels for genetic population assignment and is designed to be user-friendly. Finally, we caution users of all assignment programs to watch both for leakage of data during model training, which can lead to overfitting and inflation of performance metrics, and for unequal detection probabilities.</p>","PeriodicalId":211,"journal":{"name":"Molecular Ecology Resources","volume":" ","pages":"e14096"},"PeriodicalIF":5.5000,"publicationDate":"2025-03-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Molecular Ecology Resources","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1111/1755-0998.14096","RegionNum":1,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"BIOCHEMISTRY & MOLECULAR BIOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
The ability to assign biological samples to source populations with high accuracy and precision based on genetic variation is important for numerous applications from ecological studies through wildlife conservation to epidemiology. However, population assignment when genetic differentiation is low is challenging, and methods to address this problem are lacking. The application of artificial neural networks to population assignment using genomic data is highly promising. Here we present popfinder: a new, easy-to-use Python-based artificial neural network pipeline for genetic population assignment. We tested popfinder both with simulated genetic data from populations connected by varying levels of gene flow and with reduced-representation sequence data for three species of seabirds with weak to no population genetic structure. Popfinder was able to assign individuals to their source populations with high accuracy, precision and recall in most cases, including both simulated and empirical data sets, except in the empirical data set with the weakest population structure, where the comparator programs also performed poorly. Compared to other available software, popfinder was slower on the simulated data sets due to hyperparameter tuning and the fact that it does not reduce the dimensionality of the data set; however, all programs ran in seconds on empirical data sets. Additionally, popfinder provides a perturbation ranking method to help develop optimised SNP panels for genetic population assignment and is designed to be user-friendly. Finally, we caution users of all assignment programs to watch both for leakage of data during model training, which can lead to overfitting and inflation of performance metrics, and for unequal detection probabilities.
期刊介绍:
Molecular Ecology Resources promotes the creation of comprehensive resources for the scientific community, encompassing computer programs, statistical and molecular advancements, and a diverse array of molecular tools. Serving as a conduit for disseminating these resources, the journal targets a broad audience of researchers in the fields of evolution, ecology, and conservation. Articles in Molecular Ecology Resources are crafted to support investigations tackling significant questions within these disciplines.
In addition to original resource articles, Molecular Ecology Resources features Reviews, Opinions, and Comments relevant to the field. The journal also periodically releases Special Issues focusing on resource development within specific areas.