XGBoost as a reliable machine learning tool for predicting ancestry using autosomal STR profiles - Proof of method.

Dejan Šorgić, Aleksandra Stefanović, Dušan Keckarević, Mladen Popović
{"title":"XGBoost as a reliable machine learning tool for predicting ancestry using autosomal STR profiles - Proof of method.","authors":"Dejan Šorgić, Aleksandra Stefanović, Dušan Keckarević, Mladen Popović","doi":"10.1016/j.fsigen.2024.103183","DOIUrl":null,"url":null,"abstract":"<p><p>The aim of this study was to test the validity of a predictive model of ancestry affiliation based on Short Tandem Repeat (STR) profiles. Frequencies of 29 genetic markers from the Promega website for four distinct population groups (African Americans, Asians, Caucasians, Hispanic Americans) were used to generate 360,000 profiles (90000 profiles per group), which were later used to train and test a range of machine learning algorithms with the goal of establishing the most optimal model for accurate ancestry prediction. The chosen models (Decision Trees, Support Vector Machines, XGBoost, among others) were deployed in Python, and their performance was compared. The XGBoost model outperformed others, displaying significant predictive power with an accuracy rating of 94.24 % for all four classes, and an accuracy rating of 99.06 % on a differentiation task involving Asian, African American, and Caucasian subsamples and an accuracy rating of 98.57 % when differentiating between the African-American, Asian, and the mixed group combining Caucasians and Hispanics. Evaluating the impact of training set size revealed that model accuracy peaked at 94 % with 90,000 profiles per category, but decreased to 83 % as the number of profiles per category was reduced to 500, particularly affecting precision when distinguishing between Caucasian and Hispanic subgroups. The study further investigated the impact of marker quantity on model accuracy, finding that the use of 21 markers, commonly available in commercial amplification kits, resulted in an accuracy of 96.3 % for African Americans, Asians, and Caucasians, and 88.28 % for all four groups combined. These findings underscore the potential of STR-based models in forensic analysis and hint at the broader applicability of machine learning in genetic ancestry determination, with implications for enhancing the precision and reliability of forensic investigations, particularly in heterogeneous environments where ancestral background can be a crucial piece of information.</p>","PeriodicalId":94012,"journal":{"name":"Forensic science international. Genetics","volume":"76 ","pages":"103183"},"PeriodicalIF":0.0000,"publicationDate":"2024-11-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Forensic science international. Genetics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1016/j.fsigen.2024.103183","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The aim of this study was to test the validity of a predictive model of ancestry affiliation based on Short Tandem Repeat (STR) profiles. Frequencies of 29 genetic markers from the Promega website for four distinct population groups (African Americans, Asians, Caucasians, Hispanic Americans) were used to generate 360,000 profiles (90000 profiles per group), which were later used to train and test a range of machine learning algorithms with the goal of establishing the most optimal model for accurate ancestry prediction. The chosen models (Decision Trees, Support Vector Machines, XGBoost, among others) were deployed in Python, and their performance was compared. The XGBoost model outperformed others, displaying significant predictive power with an accuracy rating of 94.24 % for all four classes, and an accuracy rating of 99.06 % on a differentiation task involving Asian, African American, and Caucasian subsamples and an accuracy rating of 98.57 % when differentiating between the African-American, Asian, and the mixed group combining Caucasians and Hispanics. Evaluating the impact of training set size revealed that model accuracy peaked at 94 % with 90,000 profiles per category, but decreased to 83 % as the number of profiles per category was reduced to 500, particularly affecting precision when distinguishing between Caucasian and Hispanic subgroups. The study further investigated the impact of marker quantity on model accuracy, finding that the use of 21 markers, commonly available in commercial amplification kits, resulted in an accuracy of 96.3 % for African Americans, Asians, and Caucasians, and 88.28 % for all four groups combined. These findings underscore the potential of STR-based models in forensic analysis and hint at the broader applicability of machine learning in genetic ancestry determination, with implications for enhancing the precision and reliability of forensic investigations, particularly in heterogeneous environments where ancestral background can be a crucial piece of information.

XGBoost是一种可靠的机器学习工具,用于使用常染色体STR谱预测祖先-方法证明。
本研究的目的是检验基于短串联重复序列(STR)谱的祖先隶属关系预测模型的有效性。来自Promega网站的29个遗传标记的频率用于四个不同的人群(非洲裔美国人、亚洲人、高加索人、西班牙裔美国人),生成了36万个档案(每个群体9万个档案),这些档案后来被用于训练和测试一系列机器学习算法,目的是建立最优的模型,以准确预测祖先。选择的模型(决策树,支持向量机,XGBoost等)在Python中部署,并比较它们的性能。XGBoost模型优于其他模型,在所有四个类别中显示出显著的预测能力,准确率为94.24 %,在涉及亚洲,非洲裔美国人和高加索人子样本的区分任务中准确率为99.06 %,在区分非裔美国人,亚洲人和高加索人和西班牙人的混合组时准确率为98.57 %。评估训练集大小的影响显示,当每个类别有90,000个配置文件时,模型准确率达到94 %的峰值,但当每个类别的配置文件数量减少到500个时,模型准确率下降到83 %,特别是在区分高加索人和西班牙裔亚组时影响精度。该研究进一步调查了标记物数量对模型准确性的影响,发现使用21种标记物(通常在商业扩增试剂盒中可用)对非洲裔美国人、亚洲人和高加索人的准确率为96.3% %,对所有四种人群的准确率为88.28 %。这些发现强调了基于str的模型在法医分析中的潜力,并暗示了机器学习在遗传祖先测定中的更广泛适用性,这对提高法医调查的准确性和可靠性具有重要意义,特别是在祖先背景可能是关键信息的异质环境中。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信