A scalable tool for analyzing genomic variants of humans using knowledge graphs and graph machine learning.

IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers in Big Data Pub Date : 2025-01-21 eCollection Date: 2024-01-01 DOI:10.3389/fdata.2024.1466391

Shivika Prasanna, Ajay Kumar, Deepthi Rao, Eduardo J Simoes, Praveen Rao

{"title":"A scalable tool for analyzing genomic variants of humans using knowledge graphs and graph machine learning.","authors":"Shivika Prasanna, Ajay Kumar, Deepthi Rao, Eduardo J Simoes, Praveen Rao","doi":"10.3389/fdata.2024.1466391","DOIUrl":null,"url":null,"abstract":"<p><p>Advances in high-throughput genome sequencing have enabled large-scale genome sequencing in clinical practice and research studies. By analyzing genomic variants of humans, scientists can gain better understanding of the risk factors of complex diseases such as cancer and COVID-19. To model and analyze the rich genomic data, knowledge graphs (KGs) and graph machine learning (GML) can be regarded as enabling technologies. In this article, we present a scalable tool called VariantKG for analyzing genomic variants of humans modeled using KGs and GML. Specifically, we used publicly available genome sequencing data from patients with COVID-19. VariantKG extracts variant-level genetic information output by a variant calling pipeline, annotates the variant data with additional metadata, and converts the annotated variant information into a KG represented using the Resource Description Framework (RDF). The resulting KG is further enhanced with patient metadata and stored in a scalable graph database that enables efficient RDF indexing and query processing. VariantKG employs the Deep Graph Library (DGL) to perform GML tasks such as node classification. A user can extract a subset of the KG and perform inference tasks using DGL. The user can monitor the training and testing performance and hardware utilization. We tested VariantKG for KG construction by using 1,508 genome sequences, leading to 4 billion RDF statements. We evaluated GML tasks using VariantKG by selecting a subset of 500 sequences from the KG and performing node classification using well-known GML techniques such as GraphSAGE, Graph Convolutional Network (GCN) and Graph Transformer. VariantKG has intuitive user interfaces and features enabling a low barrier to entry for KG construction, model inference, and model interpretation on genomic variants of humans.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1466391"},"PeriodicalIF":2.4000,"publicationDate":"2025-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11790625/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Big Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fdata.2024.1466391","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Advances in high-throughput genome sequencing have enabled large-scale genome sequencing in clinical practice and research studies. By analyzing genomic variants of humans, scientists can gain better understanding of the risk factors of complex diseases such as cancer and COVID-19. To model and analyze the rich genomic data, knowledge graphs (KGs) and graph machine learning (GML) can be regarded as enabling technologies. In this article, we present a scalable tool called VariantKG for analyzing genomic variants of humans modeled using KGs and GML. Specifically, we used publicly available genome sequencing data from patients with COVID-19. VariantKG extracts variant-level genetic information output by a variant calling pipeline, annotates the variant data with additional metadata, and converts the annotated variant information into a KG represented using the Resource Description Framework (RDF). The resulting KG is further enhanced with patient metadata and stored in a scalable graph database that enables efficient RDF indexing and query processing. VariantKG employs the Deep Graph Library (DGL) to perform GML tasks such as node classification. A user can extract a subset of the KG and perform inference tasks using DGL. The user can monitor the training and testing performance and hardware utilization. We tested VariantKG for KG construction by using 1,508 genome sequences, leading to 4 billion RDF statements. We evaluated GML tasks using VariantKG by selecting a subset of 500 sequences from the KG and performing node classification using well-known GML techniques such as GraphSAGE, Graph Convolutional Network (GCN) and Graph Transformer. VariantKG has intuitive user interfaces and features enabling a low barrier to entry for KG construction, model inference, and model interpretation on genomic variants of humans.

Abstract Image

查看原文本刊更多论文

使用知识图和图机器学习分析人类基因组变异的可扩展工具。

高通量基因组测序技术的进步使大规模基因组测序在临床实践和研究中得以应用。通过分析人类的基因组变异，科学家可以更好地了解癌症和COVID-19等复杂疾病的危险因素。为了对丰富的基因组数据进行建模和分析，知识图（KGs）和图机器学习（GML）可以被视为使能技术。在本文中，我们介绍了一个可扩展的工具VariantKG，用于分析使用KGs和GML建模的人类基因组变异。具体来说，我们使用了COVID-19患者的公开基因组测序数据。VariantKG提取由变体调用管道输出的变体级遗传信息，用额外的元数据对变体数据进行注释，并将注释的变体信息转换为使用资源描述框架（RDF）表示的KG。生成的KG使用患者元数据进一步增强，并存储在可扩展的图形数据库中，该数据库支持高效的RDF索引和查询处理。VariantKG使用深度图库（DGL）来执行节点分类等GML任务。用户可以提取KG的一个子集，并使用DGL执行推理任务。用户可以监控训练和测试性能以及硬件利用率。我们使用1508个基因组序列对VariantKG进行了KG构建测试，得到了40亿个RDF语句。我们使用VariantKG评估GML任务，从KG中选择500个序列的子集，并使用GraphSAGE、Graph Convolutional Network （GCN）和Graph Transformer等著名的GML技术进行节点分类。VariantKG具有直观的用户界面和功能，可以为人类基因组变异的KG构建，模型推断和模型解释提供低门槛。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊