Diagnosis of a Single-Nucleotide Variant in Whole-Exome Sequencing Data for Patients With Inherited Diseases: Machine Learning Study Using Artificial Intelligence Variant Prioritization.

JMIR bioinformatics and biotechnology Pub Date : 2022-09-15 DOI:10.2196/37701

Yu-Shan Huang, Ching Hsu, Yu-Chang Chune, I-Cheng Liao, Hsin Wang, Yi-Lin Lin, Wuh-Liang Hwu, Ni-Chung Lee, Feipei Lai

{"title":"Diagnosis of a Single-Nucleotide Variant in Whole-Exome Sequencing Data for Patients With Inherited Diseases: Machine Learning Study Using Artificial Intelligence Variant Prioritization.","authors":"Yu-Shan Huang, Ching Hsu, Yu-Chang Chune, I-Cheng Liao, Hsin Wang, Yi-Lin Lin, Wuh-Liang Hwu, Ni-Chung Lee, Feipei Lai","doi":"10.2196/37701","DOIUrl":null,"url":null,"abstract":"Background: In recent years, thanks to the rapid development of next-generation sequencing (NGS) technology, an entire human genome can be sequenced in a short period. As a result, NGS technology is now being widely introduced into clinical diagnosis practice, especially for diagnosis of hereditary disorders. Although the exome data of single-nucleotide variant (SNV) can be generated using these approaches, processing the DNA sequence data of a patient requires multiple tools and complex bioinformatics pipelines.Objective: This study aims to assist physicians to automatically interpret the genetic variation information generated by NGS in a short period. To determine the true causal variants of a patient with genetic disease, currently, physicians often need to view numerous features on every variant manually and search for literature in different databases to understand the effect of genetic variation.Methods: We constructed a machine learning model for predicting disease-causing variants in exome data. We collected sequencing data from whole-exome sequencing (WES) and gene panel as training set, and then integrated variant annotations from multiple genetic databases for model training. The model built ranked SNVs and output the most possible disease-causing candidates. For model testing, we collected WES data from 108 patients with rare genetic disorders in National Taiwan University Hospital. We applied sequencing data and phenotypic information automatically extracted by a keyword extraction tool from patient's electronic medical records into our machine learning model.Results: We succeeded in locating 92.5% (124/134) of the causative variant in the top 10 ranking list among an average of 741 candidate variants per person after filtering. AI Variant Prioritizer was able to assign the target gene to the top rank for around 61.1% (66/108) of the patients, followed by Variant Prioritizer, which assigned it for 44.4% (48/108) of the patients. The cumulative rank result revealed that our AI Variant Prioritizer has the highest accuracy at ranks 1, 5, 10, and 20. It also shows that AI Variant Prioritizer presents better performance than other tools. After adopting the Human Phenotype Ontology (HPO) terms by looking up the databases, the top 10 ranking list can be increased to 93.5% (101/108).Conclusions: We successfully applied sequencing data from WES and free-text phenotypic information of patient's disease automatically extracted by the keyword extraction tool for model training and testing. By interpreting our model, we identified which features of variants are important. Besides, we achieved a satisfactory result on finding the target variant in our testing data set. After adopting the HPO terms by looking up the databases, the top 10 ranking list can be increased to 93.5% (101/108). The performance of the model is similar to that of manual analysis, and it has been used to help National Taiwan University Hospital with a genetic diagnosis.","PeriodicalId":73552,"journal":{"name":"JMIR bioinformatics and biotechnology","volume":" ","pages":"e37701"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11168239/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR bioinformatics and biotechnology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/37701","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background: In recent years, thanks to the rapid development of next-generation sequencing (NGS) technology, an entire human genome can be sequenced in a short period. As a result, NGS technology is now being widely introduced into clinical diagnosis practice, especially for diagnosis of hereditary disorders. Although the exome data of single-nucleotide variant (SNV) can be generated using these approaches, processing the DNA sequence data of a patient requires multiple tools and complex bioinformatics pipelines.

Objective: This study aims to assist physicians to automatically interpret the genetic variation information generated by NGS in a short period. To determine the true causal variants of a patient with genetic disease, currently, physicians often need to view numerous features on every variant manually and search for literature in different databases to understand the effect of genetic variation.

Methods: We constructed a machine learning model for predicting disease-causing variants in exome data. We collected sequencing data from whole-exome sequencing (WES) and gene panel as training set, and then integrated variant annotations from multiple genetic databases for model training. The model built ranked SNVs and output the most possible disease-causing candidates. For model testing, we collected WES data from 108 patients with rare genetic disorders in National Taiwan University Hospital. We applied sequencing data and phenotypic information automatically extracted by a keyword extraction tool from patient's electronic medical records into our machine learning model.

Results: We succeeded in locating 92.5% (124/134) of the causative variant in the top 10 ranking list among an average of 741 candidate variants per person after filtering. AI Variant Prioritizer was able to assign the target gene to the top rank for around 61.1% (66/108) of the patients, followed by Variant Prioritizer, which assigned it for 44.4% (48/108) of the patients. The cumulative rank result revealed that our AI Variant Prioritizer has the highest accuracy at ranks 1, 5, 10, and 20. It also shows that AI Variant Prioritizer presents better performance than other tools. After adopting the Human Phenotype Ontology (HPO) terms by looking up the databases, the top 10 ranking list can be increased to 93.5% (101/108).

Conclusions: We successfully applied sequencing data from WES and free-text phenotypic information of patient's disease automatically extracted by the keyword extraction tool for model training and testing. By interpreting our model, we identified which features of variants are important. Besides, we achieved a satisfactory result on finding the target variant in our testing data set. After adopting the HPO terms by looking up the databases, the top 10 ranking list can be increased to 93.5% (101/108). The performance of the model is similar to that of manual analysis, and it has been used to help National Taiwan University Hospital with a genetic diagnosis.

Abstract Image

查看原文本刊更多论文

遗传性疾病患者全外显子组测序数据中单核苷酸变异的诊断:使用AI变异优先排序(预印本)

背景：近年来，由于新一代测序（NGS）技术的快速发展，整个人类基因组可以在短时间内完成测序。因此，NGS 技术正被广泛引入临床诊断实践，尤其是遗传性疾病的诊断。虽然单核苷酸变异（SNV）的外显子组数据可以通过这些方法生成，但处理患者的 DNA 序列数据需要多种工具和复杂的生物信息学管道：本研究旨在帮助医生在短时间内自动解读由 NGS 生成的遗传变异信息。目前，为了确定遗传病患者的真正病因变异，医生往往需要手动查看每个变异的众多特征，并在不同的数据库中搜索文献，以了解遗传变异的影响：我们构建了一个机器学习模型，用于预测外显子组数据中的致病变异。我们收集了来自全外显子组测序（WES）和基因面板的测序数据作为训练集，然后整合了来自多个遗传数据库的变异注释进行模型训练。建立的模型对 SNV 进行排序，并输出最可能的致病候选者。为了测试模型，我们收集了台大医院 108 位罕见遗传疾病患者的 WES 数据。我们将测序数据和通过关键字提取工具从患者电子病历中自动提取的表型信息应用到机器学习模型中：结果：在平均每人 741 个候选变异体中，我们成功找到了 92.5%（124/134）的致病变异体。人工智能变异体排序器能将约61.1%（66/108）的患者的目标基因排在前列，其次是变异体排序器，将44.4%（48/108）的患者的目标基因排在前列。累积排名结果显示，人工智能变体优先器在排名 1、5、10 和 20 时的准确率最高。这也表明，人工智能变体优先器比其他工具具有更好的性能。在通过查询数据库采用人类表型本体（HPO）术语后，前10名的排序率可提高到93.5%（101/108）：我们成功地将 WES 的测序数据和关键词提取工具自动提取的患者疾病自由文本表型信息用于模型训练和测试。通过解释我们的模型，我们确定了哪些变异特征是重要的。此外，我们还在测试数据集中找到了目标变体，并取得了令人满意的结果。在通过查找数据库采用 HPO 术语后，排名前 10 的列表可增加到 93.5%（101/108）。该模型的性能与人工分析相似，并已用于帮助台湾大学医院进行基因诊断。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

JMIR bioinformatics and biotechnology

CiteScore

2.90

自引率

0.00%

发文量