Prioritizing genomic variants pathogenicity via DNA, RNA, and protein-level features based on extreme gradient boosting

IF 3.8 2区生物学 Q2 GENETICS & HEREDITY

Human Genetics Pub Date : 2024-04-04 DOI:10.1007/s00439-024-02667-0

Maolin Ding, Ken Chen, Yuedong Yang, Huiying Zhao

{"title":"Prioritizing genomic variants pathogenicity via DNA, RNA, and protein-level features based on extreme gradient boosting","authors":"Maolin Ding, Ken Chen, Yuedong Yang, Huiying Zhao","doi":"10.1007/s00439-024-02667-0","DOIUrl":null,"url":null,"abstract":"<p>Genetic diseases are mostly implicated with genetic variants, including missense, synonymous, non-sense, and copy number variants. These different kinds of variants are indicated to affect phenotypes in various ways from previous studies. It remains essential but challenging to understand the functional consequences of these genetic variants, especially the noncoding ones, due to the lack of corresponding annotations. While many computational methods have been proposed to identify the risk variants. Most of them have only curated DNA-level and protein-level annotations to predict the pathogenicity of the variants, and others have been restricted to missense variants exclusively. In this study, we have curated DNA-, RNA-, and protein-level features to discriminate disease-causing variants in both coding and noncoding regions, where the features of protein sequences and protein structures have been shown essential for analyzing missense variants in coding regions while the features related to RNA-splicing and RBP binding are significant for variants in noncoding regions and synonymous variants in coding regions. Through the integration of these features, we have formulated the Multi-level feature Genomic Variants Predictor (ML-GVP) using the gradient boosting tree. The method has been trained on more than 400,000 variants in the Sherloc-training set from the 6th critical assessment of genome interpretation with superior performance. The method is one of the two best-performing predictors on the blind test in the Sherloc assessment, and is further confirmed by another independent test dataset of de novo variants.</p>","PeriodicalId":13175,"journal":{"name":"Human Genetics","volume":"20 1","pages":""},"PeriodicalIF":3.8000,"publicationDate":"2024-04-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Human Genetics","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1007/s00439-024-02667-0","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"GENETICS & HEREDITY","Score":null,"Total":0}

引用次数: 0

Abstract

Genetic diseases are mostly implicated with genetic variants, including missense, synonymous, non-sense, and copy number variants. These different kinds of variants are indicated to affect phenotypes in various ways from previous studies. It remains essential but challenging to understand the functional consequences of these genetic variants, especially the noncoding ones, due to the lack of corresponding annotations. While many computational methods have been proposed to identify the risk variants. Most of them have only curated DNA-level and protein-level annotations to predict the pathogenicity of the variants, and others have been restricted to missense variants exclusively. In this study, we have curated DNA-, RNA-, and protein-level features to discriminate disease-causing variants in both coding and noncoding regions, where the features of protein sequences and protein structures have been shown essential for analyzing missense variants in coding regions while the features related to RNA-splicing and RBP binding are significant for variants in noncoding regions and synonymous variants in coding regions. Through the integration of these features, we have formulated the Multi-level feature Genomic Variants Predictor (ML-GVP) using the gradient boosting tree. The method has been trained on more than 400,000 variants in the Sherloc-training set from the 6th critical assessment of genome interpretation with superior performance. The method is one of the two best-performing predictors on the blind test in the Sherloc assessment, and is further confirmed by another independent test dataset of de novo variants.

Abstract Image

查看原文本刊更多论文

基于极端梯度提升技术，通过 DNA、RNA 和蛋白质级特征确定基因组变体致病性的优先次序

遗传疾病大多与基因变异有关，包括错义变异、同义变异、非义变异和拷贝数变异。以往的研究表明，这些不同类型的变异会以不同的方式影响表型。由于缺乏相应的注释，了解这些基因变异（尤其是非编码基因变异）的功能性后果仍然非常重要，但也极具挑战性。虽然已经提出了许多计算方法来识别风险变异。其中大多数方法仅通过DNA水平和蛋白质水平的注释来预测变异的致病性，其他方法则仅限于错义变异。在本研究中，我们对 DNA、RNA 和蛋白质水平的特征进行了整理，以区分编码区和非编码区的致病变异，其中蛋白质序列和蛋白质结构的特征对分析编码区的错义变异至关重要，而与 RNA 剪接和 RBP 结合相关的特征对非编码区的变异和编码区的同义变异意义重大。通过整合这些特征，我们利用梯度提升树建立了多级特征基因组变异预测器（ML-GVP）。该方法已在第六次基因组解读关键评估的 Sherloc 训练集中训练了 40 多万个变异，并取得了优异的成绩。该方法是 Sherloc 评估中盲测表现最好的两个预测方法之一，并在另一个独立的新变异测试数据集上得到了进一步证实。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Human Genetics 生物-遗传学

CiteScore

10.80

自引率

3.80%

发文量

审稿时长

1 months

期刊介绍： Human Genetics is a monthly journal publishing original and timely articles on all aspects of human genetics. The Journal particularly welcomes articles in the areas of Behavioral genetics, Bioinformatics, Cancer genetics and genomics, Cytogenetics, Developmental genetics, Disease association studies, Dysmorphology, ELSI (ethical, legal and social issues), Evolutionary genetics, Gene expression, Gene structure and organization, Genetics of complex diseases and epistatic interactions, Genetic epidemiology, Genome biology, Genome structure and organization, Genotype-phenotype relationships, Human Genomics, Immunogenetics and genomics, Linkage analysis and genetic mapping, Methods in Statistical Genetics, Molecular diagnostics, Mutation detection and analysis, Neurogenetics, Physical mapping and Population Genetics. Articles reporting animal models relevant to human biology or disease are also welcome. Preference will be given to those articles which address clinically relevant questions or which provide new insights into human biology. Unless reporting entirely novel and unusual aspects of a topic, clinical case reports, cytogenetic case reports, papers on descriptive population genetics, articles dealing with the frequency of polymorphisms or additional mutations within genes in which numerous lesions have already been described, and papers that report meta-analyses of previously published datasets will normally not be accepted. The Journal typically will not consider for publication manuscripts that report merely the isolation, map position, structure, and tissue expression profile of a gene of unknown function unless the gene is of particular interest or is a candidate gene involved in a human trait or disorder.