Clarissa R. Jolley, Hannah J. Lee, Kristen A. Lucas, William P. McDevitt
{"title":"短串联重复序列分析作为生物地理祖先预测的新方法","authors":"Clarissa R. Jolley, Hannah J. Lee, Kristen A. Lucas, William P. McDevitt","doi":"10.1109/sieds55548.2022.9799365","DOIUrl":null,"url":null,"abstract":"Assessing DNA to determine the biogeographic ancestry of an individual continues to be a major task in forensic laboratories across the world. Due to the costly nature associated with full-scale genomic data acquisition and processing, many forensic laboratories lack the ability to conduct comprehensive genetic testing involving analyzing ancestry-informative single nucleotide polymorphisms (aiSNP), therefore, creating the need for more cost effective sources of information. In the present study, we assessed the use of machine learning (ML) approaches in the analysis of short tandem repeats (STRs), non-coding repeats of a short sequence of DNA, in order to determine biogeographic ancestry. STRs are repeat sequences in which a unit of 1-to-25 nucleotides in length exists at various locations across the genome. Because of the high variability of STRs, STRs are widely used for creating unique genetic profiles of different individuals. We analyzed the performance of selected loci in random forest classification models using anonymized STR data, provided by the US Department of Defense (DoD), collected from $\\mathrm{N}=1747$ subjects across $\\mathrm{K}=5$ continents in order to predict the continental origins of each individual given their genome. Supervised classification test accuracy of subjects varied from $\\sim45\\%$ to $> 60\\%$ while 10-fold training accuracy varied from 60% to $\\sim80\\%$ across the profiles surveyed. Unsupervised clustering test accuracy was reported to be $\\sim35\\%$. Our findings indicate that there is a significant possibility in using STR data as a novel method for continental ancestry prediction, and with further research, high accuracy may be reached. We conclude this article with comments on future strategies for parameter optimization to maximize utility of STR analysis which may be beneficial to smaller laboratories as well as expedite biogeographic ancestry for forensic professionals and law enforcement officials.","PeriodicalId":286724,"journal":{"name":"2022 Systems and Information Engineering Design Symposium (SIEDS)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Short Tandem Repeat Analysis as a Novel Method for Biogeographic Ancestry Prediction\",\"authors\":\"Clarissa R. Jolley, Hannah J. Lee, Kristen A. Lucas, William P. McDevitt\",\"doi\":\"10.1109/sieds55548.2022.9799365\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Assessing DNA to determine the biogeographic ancestry of an individual continues to be a major task in forensic laboratories across the world. Due to the costly nature associated with full-scale genomic data acquisition and processing, many forensic laboratories lack the ability to conduct comprehensive genetic testing involving analyzing ancestry-informative single nucleotide polymorphisms (aiSNP), therefore, creating the need for more cost effective sources of information. In the present study, we assessed the use of machine learning (ML) approaches in the analysis of short tandem repeats (STRs), non-coding repeats of a short sequence of DNA, in order to determine biogeographic ancestry. STRs are repeat sequences in which a unit of 1-to-25 nucleotides in length exists at various locations across the genome. Because of the high variability of STRs, STRs are widely used for creating unique genetic profiles of different individuals. We analyzed the performance of selected loci in random forest classification models using anonymized STR data, provided by the US Department of Defense (DoD), collected from $\\\\mathrm{N}=1747$ subjects across $\\\\mathrm{K}=5$ continents in order to predict the continental origins of each individual given their genome. Supervised classification test accuracy of subjects varied from $\\\\sim45\\\\%$ to $> 60\\\\%$ while 10-fold training accuracy varied from 60% to $\\\\sim80\\\\%$ across the profiles surveyed. Unsupervised clustering test accuracy was reported to be $\\\\sim35\\\\%$. Our findings indicate that there is a significant possibility in using STR data as a novel method for continental ancestry prediction, and with further research, high accuracy may be reached. We conclude this article with comments on future strategies for parameter optimization to maximize utility of STR analysis which may be beneficial to smaller laboratories as well as expedite biogeographic ancestry for forensic professionals and law enforcement officials.\",\"PeriodicalId\":286724,\"journal\":{\"name\":\"2022 Systems and Information Engineering Design Symposium (SIEDS)\",\"volume\":\"78 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-04-28\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 Systems and Information Engineering Design Symposium (SIEDS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/sieds55548.2022.9799365\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 Systems and Information Engineering Design Symposium (SIEDS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/sieds55548.2022.9799365","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Short Tandem Repeat Analysis as a Novel Method for Biogeographic Ancestry Prediction
Assessing DNA to determine the biogeographic ancestry of an individual continues to be a major task in forensic laboratories across the world. Due to the costly nature associated with full-scale genomic data acquisition and processing, many forensic laboratories lack the ability to conduct comprehensive genetic testing involving analyzing ancestry-informative single nucleotide polymorphisms (aiSNP), therefore, creating the need for more cost effective sources of information. In the present study, we assessed the use of machine learning (ML) approaches in the analysis of short tandem repeats (STRs), non-coding repeats of a short sequence of DNA, in order to determine biogeographic ancestry. STRs are repeat sequences in which a unit of 1-to-25 nucleotides in length exists at various locations across the genome. Because of the high variability of STRs, STRs are widely used for creating unique genetic profiles of different individuals. We analyzed the performance of selected loci in random forest classification models using anonymized STR data, provided by the US Department of Defense (DoD), collected from $\mathrm{N}=1747$ subjects across $\mathrm{K}=5$ continents in order to predict the continental origins of each individual given their genome. Supervised classification test accuracy of subjects varied from $\sim45\%$ to $> 60\%$ while 10-fold training accuracy varied from 60% to $\sim80\%$ across the profiles surveyed. Unsupervised clustering test accuracy was reported to be $\sim35\%$. Our findings indicate that there is a significant possibility in using STR data as a novel method for continental ancestry prediction, and with further research, high accuracy may be reached. We conclude this article with comments on future strategies for parameter optimization to maximize utility of STR analysis which may be beneficial to smaller laboratories as well as expedite biogeographic ancestry for forensic professionals and law enforcement officials.