Using simulated microhaplotype genotyping data to evaluate the value of machine learning algorithms for inferring DNA mixture contributor numbers

IF 3.2 2区医学 Q2 GENETICS & HEREDITY

Forensic Science International-Genetics Pub Date : 2024-01-09 DOI:10.1016/j.fsigen.2024.103008

Haoyu Wang , Qiang Zhu , Yuguo Huang, Yueyan Cao, Yuhan Hu, Yifan Wei, Yuting Wang, Tingyun Hou, Tiantian Shan, Xuan Dai, Xiaokang Zhang, Yufang Wang, Ji Zhang

{"title":"Using simulated microhaplotype genotyping data to evaluate the value of machine learning algorithms for inferring DNA mixture contributor numbers","authors":"Haoyu Wang , Qiang Zhu , Yuguo Huang, Yueyan Cao, Yuhan Hu, Yifan Wei, Yuting Wang, Tingyun Hou, Tiantian Shan, Xuan Dai, Xiaokang Zhang, Yufang Wang, Ji Zhang","doi":"10.1016/j.fsigen.2024.103008","DOIUrl":null,"url":null,"abstract":"<div><p>Inferring the number of contributors (NoC) is a crucial step in interpreting DNA mixtures, as it directly affects the accuracy of the likelihood ratio calculation and the assessment of evidence strength. However, obtaining the correct NoC in complex DNA mixtures remains challenging due to the high degree of allele sharing and dropout. This study aimed to analyze the impact of allele sharing and dropout on NoC inference in complex DNA mixtures when using microhaplotypes (MH). The effectiveness and value of highly polymorphic MH for NoC inference in complex DNA mixtures were evaluated through comparing the performance of three NoC inference methods, including maximum allele count (MAC) method, maximum likelihood estimation (MLE) method, and random forest classification (RFC) algorithm. In this study, we selected the top 100 most polymorphic MH from the Southern Han Chinese (CHS) population, and simulated over 40 million complex DNA mixture profiles with the NoC ranging from 2 to 8. These profiles involve unrelated individuals (RM type) and related pairs of individuals, including parent-offspring pairs (PO type), full-sibling pairs (FS type), and second-degree kinship pairs (SE type). Our results indicated that how the number of detected alleles in DNA mixture profiles varied with the markers’ polymorphism, kinship’s involvement, NoC, and dropout settings. Across different types of DNA mixtures, the MAC and MLE methods performed best in the RM type, followed by SE, FS, and PO types, while RFC models showed the best performance in the PO type, followed by RM, SE, and FS types. The recall of all three methods for NoC inference were decreased as the NoC and dropout levels increased. Furthermore, the MLE method performed better at low NoC, whereas RFC models excelled at high NoC and/or high dropout levels, regardless of the availability of a priori information about related pairs of individuals in DNA mixtures. However, the RFC models which considered the aforementioned priori information and were trained specifically on each type of DNA mixture profiles, outperformed RFC_ALL model that did not consider such information. Finally, we provided recommendations for model building when applying machine learning algorithms to NoC inference.</p></div>","PeriodicalId":50435,"journal":{"name":"Forensic Science International-Genetics","volume":null,"pages":null},"PeriodicalIF":3.2000,"publicationDate":"2024-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Forensic Science International-Genetics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1872497324000024","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}

引用次数: 0

Abstract

Inferring the number of contributors (NoC) is a crucial step in interpreting DNA mixtures, as it directly affects the accuracy of the likelihood ratio calculation and the assessment of evidence strength. However, obtaining the correct NoC in complex DNA mixtures remains challenging due to the high degree of allele sharing and dropout. This study aimed to analyze the impact of allele sharing and dropout on NoC inference in complex DNA mixtures when using microhaplotypes (MH). The effectiveness and value of highly polymorphic MH for NoC inference in complex DNA mixtures were evaluated through comparing the performance of three NoC inference methods, including maximum allele count (MAC) method, maximum likelihood estimation (MLE) method, and random forest classification (RFC) algorithm. In this study, we selected the top 100 most polymorphic MH from the Southern Han Chinese (CHS) population, and simulated over 40 million complex DNA mixture profiles with the NoC ranging from 2 to 8. These profiles involve unrelated individuals (RM type) and related pairs of individuals, including parent-offspring pairs (PO type), full-sibling pairs (FS type), and second-degree kinship pairs (SE type). Our results indicated that how the number of detected alleles in DNA mixture profiles varied with the markers’ polymorphism, kinship’s involvement, NoC, and dropout settings. Across different types of DNA mixtures, the MAC and MLE methods performed best in the RM type, followed by SE, FS, and PO types, while RFC models showed the best performance in the PO type, followed by RM, SE, and FS types. The recall of all three methods for NoC inference were decreased as the NoC and dropout levels increased. Furthermore, the MLE method performed better at low NoC, whereas RFC models excelled at high NoC and/or high dropout levels, regardless of the availability of a priori information about related pairs of individuals in DNA mixtures. However, the RFC models which considered the aforementioned priori information and were trained specifically on each type of DNA mixture profiles, outperformed RFC_ALL model that did not consider such information. Finally, we provided recommendations for model building when applying machine learning algorithms to NoC inference.

查看原文本刊更多论文

利用模拟微单型基因分型数据评估机器学习算法在推断 DNA 混合体贡献者数量方面的价值

推断贡献者数量（NoC）是解释 DNA 混合物的关键步骤，因为它直接影响到似然比计算的准确性和证据强度的评估。然而，由于等位基因共享和丢失的程度很高，在复杂的 DNA 混合物中获得正确的 NoC 仍然具有挑战性。本研究旨在分析等位基因共享和脱落对利用微组型（MH）推断复杂 DNA 混合物中 NoC 的影响。通过比较最大等位基因数（MAC）方法、最大似然估计（MLE）方法和随机森林分类（RFC）算法等三种NoC推断方法的性能，评估了高多态性MH在复杂DNA混合物中进行NoC推断的有效性和价值。在这项研究中，我们从中国南方汉族（CHS）人群中选取了前100个多态性最高的MH，模拟了4000多万个NoC从2到8不等的复杂DNA混合图谱。这些图谱涉及无亲缘关系的个体（RM 型）和有亲缘关系的个体配对，包括亲子配对（PO 型）、全同胞配对（FS 型）和二级亲缘配对（SE 型）。我们的研究结果表明，DNA 混合图谱中检测到的等位基因数量随标记的多态性、亲缘关系的参与度、NoC 和 dropout 设置的不同而变化。在不同类型的DNA混合物中，MAC和MLE方法在RM类型中表现最好，其次是SE、FS和PO类型，而RFC模型在PO类型中表现最好，其次是RM、SE和FS类型。随着 NoC 和辍学水平的增加，三种 NoC 推断方法的召回率（即预测正确率）都有所下降。此外，MLE 方法在低 NoC 时表现更好，而 RFC 模型在高 NoC 和/或高剔除水平时表现出色，无论是否存在 DNA 混合物中相关个体对的先验信息。然而，考虑了上述先验信息并针对每种类型的 DNA 混合物图谱进行了专门训练的 RFC 模型的表现优于未考虑此类信息的 RFC_ALL 模型。最后，我们对将机器学习算法应用于 NoC 推断时的模型构建提出了建议。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Forensic Science International-Genetics 生物-医学：法

CiteScore

7.50

自引率

32.30%

发文量

132

审稿时长

11.3 weeks

期刊介绍： Forensic Science International: Genetics is the premier journal in the field of Forensic Genetics. This branch of Forensic Science can be defined as the application of genetics to human and non-human material (in the sense of a science with the purpose of studying inherited characteristics for the analysis of inter- and intra-specific variations in populations) for the resolution of legal conflicts. The scope of the journal includes: Forensic applications of human polymorphism. Testing of paternity and other family relationships, immigration cases, typing of biological stains and tissues from criminal casework, identification of human remains by DNA testing methodologies. Description of human polymorphisms of forensic interest, with special interest in DNA polymorphisms. Autosomal DNA polymorphisms, mini- and microsatellites (or short tandem repeats, STRs), single nucleotide polymorphisms (SNPs), X and Y chromosome polymorphisms, mtDNA polymorphisms, and any other type of DNA variation with potential forensic applications. Non-human DNA polymorphisms for crime scene investigation. Population genetics of human polymorphisms of forensic interest. Population data, especially from DNA polymorphisms of interest for the solution of forensic problems. DNA typing methodologies and strategies. Biostatistical methods in forensic genetics. Evaluation of DNA evidence in forensic problems (such as paternity or immigration cases, criminal casework, identification), classical and new statistical approaches. Standards in forensic genetics. Recommendations of regulatory bodies concerning methods, markers, interpretation or strategies or proposals for procedural or technical standards. Quality control. Quality control and quality assurance strategies, proficiency testing for DNA typing methodologies. Criminal DNA databases. Technical, legal and statistical issues. General ethical and legal issues related to forensic genetics.