Short Tandem Repeat Analysis as a Novel Method for Biogeographic Ancestry Prediction

Clarissa R. Jolley, Hannah J. Lee, Kristen A. Lucas, William P. McDevitt
{"title":"Short Tandem Repeat Analysis as a Novel Method for Biogeographic Ancestry Prediction","authors":"Clarissa R. Jolley, Hannah J. Lee, Kristen A. Lucas, William P. McDevitt","doi":"10.1109/sieds55548.2022.9799365","DOIUrl":null,"url":null,"abstract":"Assessing DNA to determine the biogeographic ancestry of an individual continues to be a major task in forensic laboratories across the world. Due to the costly nature associated with full-scale genomic data acquisition and processing, many forensic laboratories lack the ability to conduct comprehensive genetic testing involving analyzing ancestry-informative single nucleotide polymorphisms (aiSNP), therefore, creating the need for more cost effective sources of information. In the present study, we assessed the use of machine learning (ML) approaches in the analysis of short tandem repeats (STRs), non-coding repeats of a short sequence of DNA, in order to determine biogeographic ancestry. STRs are repeat sequences in which a unit of 1-to-25 nucleotides in length exists at various locations across the genome. Because of the high variability of STRs, STRs are widely used for creating unique genetic profiles of different individuals. We analyzed the performance of selected loci in random forest classification models using anonymized STR data, provided by the US Department of Defense (DoD), collected from $\\mathrm{N}=1747$ subjects across $\\mathrm{K}=5$ continents in order to predict the continental origins of each individual given their genome. Supervised classification test accuracy of subjects varied from $\\sim45\\%$ to $> 60\\%$ while 10-fold training accuracy varied from 60% to $\\sim80\\%$ across the profiles surveyed. Unsupervised clustering test accuracy was reported to be $\\sim35\\%$. Our findings indicate that there is a significant possibility in using STR data as a novel method for continental ancestry prediction, and with further research, high accuracy may be reached. We conclude this article with comments on future strategies for parameter optimization to maximize utility of STR analysis which may be beneficial to smaller laboratories as well as expedite biogeographic ancestry for forensic professionals and law enforcement officials.","PeriodicalId":286724,"journal":{"name":"2022 Systems and Information Engineering Design Symposium (SIEDS)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 Systems and Information Engineering Design Symposium (SIEDS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/sieds55548.2022.9799365","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Assessing DNA to determine the biogeographic ancestry of an individual continues to be a major task in forensic laboratories across the world. Due to the costly nature associated with full-scale genomic data acquisition and processing, many forensic laboratories lack the ability to conduct comprehensive genetic testing involving analyzing ancestry-informative single nucleotide polymorphisms (aiSNP), therefore, creating the need for more cost effective sources of information. In the present study, we assessed the use of machine learning (ML) approaches in the analysis of short tandem repeats (STRs), non-coding repeats of a short sequence of DNA, in order to determine biogeographic ancestry. STRs are repeat sequences in which a unit of 1-to-25 nucleotides in length exists at various locations across the genome. Because of the high variability of STRs, STRs are widely used for creating unique genetic profiles of different individuals. We analyzed the performance of selected loci in random forest classification models using anonymized STR data, provided by the US Department of Defense (DoD), collected from $\mathrm{N}=1747$ subjects across $\mathrm{K}=5$ continents in order to predict the continental origins of each individual given their genome. Supervised classification test accuracy of subjects varied from $\sim45\%$ to $> 60\%$ while 10-fold training accuracy varied from 60% to $\sim80\%$ across the profiles surveyed. Unsupervised clustering test accuracy was reported to be $\sim35\%$. Our findings indicate that there is a significant possibility in using STR data as a novel method for continental ancestry prediction, and with further research, high accuracy may be reached. We conclude this article with comments on future strategies for parameter optimization to maximize utility of STR analysis which may be beneficial to smaller laboratories as well as expedite biogeographic ancestry for forensic professionals and law enforcement officials.
短串联重复序列分析作为生物地理祖先预测的新方法
评估DNA以确定个体的生物地理血统仍然是世界各地法医实验室的主要任务。由于与全面基因组数据采集和处理相关的昂贵性质,许多法医实验室缺乏进行包括分析祖先信息的单核苷酸多态性(aiSNP)在内的全面基因检测的能力,因此,需要更具成本效益的信息来源。在本研究中,我们评估了机器学习(ML)方法在短串联重复序列(STRs)分析中的使用,短序列DNA的非编码重复序列,以确定生物地理祖先。STRs是重复序列,其长度单位为1至25个核苷酸,存在于整个基因组的不同位置。由于STRs的高变异性,STRs被广泛用于创建不同个体的独特遗传图谱。我们使用美国国防部(DoD)提供的匿名STR数据,分析了随机森林分类模型中选定位点的性能,这些数据来自$\ mathm {N}=1747$受试者,来自$\ mathm {K}=5$大洲,以便预测每个个体在给定其基因组的情况下的大陆起源。受试者的监督分类测试准确率从$\sim45\%$到$> 60\%$不等,而10倍训练准确率从60%到$\sim80\%$不等。据报道,无监督聚类测试的准确率为$\sim35\%$。研究结果表明,利用STR数据作为大陆祖先预测的一种新方法具有很大的可能性,并且随着研究的深入,可以达到较高的精度。最后,我们对未来的参数优化策略进行了评论,以最大化STR分析的效用,这可能有利于较小的实验室,并加快法医专业人员和执法官员的生物地理血统。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信