Gene-specific pathogenicity predictor for chromatin remodeling BAF complex-associated neurodevelopmental disorders.

IF 3.6 Q2 GENETICS & HEREDITY

HGG Advances Pub Date : 2026-04-09 Epub Date: 2026-02-28 DOI:10.1016/j.xhgg.2026.100583

Joshua Hack, Mohammad Nazim

{"title":"Gene-specific pathogenicity predictor for chromatin remodeling BAF complex-associated neurodevelopmental disorders.","authors":"Joshua Hack, Mohammad Nazim","doi":"10.1016/j.xhgg.2026.100583","DOIUrl":null,"url":null,"abstract":"<p><p>Advancements in whole-genome sequencing have increased the number of variants of uncertain significance (VUS) identified in human genomes. This has created a diagnostic bottleneck for genetic counselors tasked with sifting through these variants and determining those most likely to be causative for a patient's clinical presentation. Machine learning (ML) tools can aid in identifying pathogenic variants from VUS, but there is a need for gene-specific algorithms that predict pathogenic variants with high accuracy. To address this need, we present a workflow for developing gene-specific, ensemble-learning ML tools, that leverage outputs from other algorithms, locations of variants within the gene, and evolutionary conservation data to make a prediction of pathogenicity. Variants in SMARCA2 and SMARCA4 that are associated with rare neurodevelopmental diseases were used to screen 15 ML algorithms. A random forest learner was tuned to yield a final accuracy of 0.93 on holdout data. Generalizing this predictor to other BRG1/BRM-associated factor (BAF) complex proteins resulted in a sharp decline in performance. We trained a final predictor for all genes in the study to create a predictor that identifies pathogenic variants in these BAF subunits with an accuracy of 0.91 on holdout data. This predictor specific to BAF complex proteins performs with higher accuracy and area under the precision-recall curve than any other predictor. The decline in performance when generalized to other proteins emphasizes the need for the gene-specific calibration of predictors. Our workflow for the development of such models provides a quick, computationally inexpensive route for improving the ML tools available to genetic counselors.</p>","PeriodicalId":34530,"journal":{"name":"HGG Advances","volume":" ","pages":"100583"},"PeriodicalIF":3.6000,"publicationDate":"2026-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC13000492/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"HGG Advances","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1016/j.xhgg.2026.100583","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2026/2/28 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}

引用次数: 0

Abstract

Advancements in whole-genome sequencing have increased the number of variants of uncertain significance (VUS) identified in human genomes. This has created a diagnostic bottleneck for genetic counselors tasked with sifting through these variants and determining those most likely to be causative for a patient's clinical presentation. Machine learning (ML) tools can aid in identifying pathogenic variants from VUS, but there is a need for gene-specific algorithms that predict pathogenic variants with high accuracy. To address this need, we present a workflow for developing gene-specific, ensemble-learning ML tools, that leverage outputs from other algorithms, locations of variants within the gene, and evolutionary conservation data to make a prediction of pathogenicity. Variants in SMARCA2 and SMARCA4 that are associated with rare neurodevelopmental diseases were used to screen 15 ML algorithms. A random forest learner was tuned to yield a final accuracy of 0.93 on holdout data. Generalizing this predictor to other BRG1/BRM-associated factor (BAF) complex proteins resulted in a sharp decline in performance. We trained a final predictor for all genes in the study to create a predictor that identifies pathogenic variants in these BAF subunits with an accuracy of 0.91 on holdout data. This predictor specific to BAF complex proteins performs with higher accuracy and area under the precision-recall curve than any other predictor. The decline in performance when generalized to other proteins emphasizes the need for the gene-specific calibration of predictors. Our workflow for the development of such models provides a quick, computationally inexpensive route for improving the ML tools available to genetic counselors.

查看原文本刊更多论文

染色质重塑BAF复合物相关神经发育障碍的基因特异性致病性预测因子。

全基因组测序的进步增加了在人类基因组中发现的不确定意义变异（VUS）的数量。这给遗传咨询师造成了诊断瓶颈，他们的任务是筛选这些变异，并确定那些最有可能导致患者临床表现的变异。机器学习（ML）工具可以帮助识别VUS的致病变异，但需要基因特异性算法来高精度地预测致病变异。为了满足这一需求，我们提出了一种开发基因特异性、集成学习ML工具的工作流程，该工具利用其他算法的输出、基因内变异的位置和进化保护数据来预测致病性。与罕见神经发育疾病相关的SMARCA2和SMARCA4变异被用于筛选15ml算法。对随机森林学习器进行了调整，使其在holdout数据上的最终准确率达到0.93。将这一预测推广到其他BAF复合物蛋白导致性能急剧下降。我们训练了研究中所有基因的最终预测器，以创建一个预测器，识别这些BAF亚基的致病变异，在保留数据上的准确性为0.91。与其他预测器相比，该预测器对BAF复合物蛋白具有更高的准确性和AUPRC。当推广到其他蛋白质时，性能的下降强调需要对预测因子进行基因特异性校准。我们开发此类模型的工作流程为改进遗传咨询师可用的ML工具提供了快速，计算成本低廉的途径。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊