A scalable tool for analyzing genomic variants of humans using knowledge graphs and graph machine learning.

IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS
Frontiers in Big Data Pub Date : 2025-01-21 eCollection Date: 2024-01-01 DOI:10.3389/fdata.2024.1466391
Shivika Prasanna, Ajay Kumar, Deepthi Rao, Eduardo J Simoes, Praveen Rao
{"title":"A scalable tool for analyzing genomic variants of humans using knowledge graphs and graph machine learning.","authors":"Shivika Prasanna, Ajay Kumar, Deepthi Rao, Eduardo J Simoes, Praveen Rao","doi":"10.3389/fdata.2024.1466391","DOIUrl":null,"url":null,"abstract":"<p><p>Advances in high-throughput genome sequencing have enabled large-scale genome sequencing in clinical practice and research studies. By analyzing genomic variants of humans, scientists can gain better understanding of the risk factors of complex diseases such as cancer and COVID-19. To model and analyze the rich genomic data, knowledge graphs (KGs) and graph machine learning (GML) can be regarded as enabling technologies. In this article, we present a scalable tool called VariantKG for analyzing genomic variants of humans modeled using KGs and GML. Specifically, we used publicly available genome sequencing data from patients with COVID-19. VariantKG extracts variant-level genetic information output by a variant calling pipeline, annotates the variant data with additional metadata, and converts the annotated variant information into a KG represented using the Resource Description Framework (RDF). The resulting KG is further enhanced with patient metadata and stored in a scalable graph database that enables efficient RDF indexing and query processing. VariantKG employs the Deep Graph Library (DGL) to perform GML tasks such as node classification. A user can extract a subset of the KG and perform inference tasks using DGL. The user can monitor the training and testing performance and hardware utilization. We tested VariantKG for KG construction by using 1,508 genome sequences, leading to 4 billion RDF statements. We evaluated GML tasks using VariantKG by selecting a subset of 500 sequences from the KG and performing node classification using well-known GML techniques such as GraphSAGE, Graph Convolutional Network (GCN) and Graph Transformer. VariantKG has intuitive user interfaces and features enabling a low barrier to entry for KG construction, model inference, and model interpretation on genomic variants of humans.</p>","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1466391"},"PeriodicalIF":2.4000,"publicationDate":"2025-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11790625/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Big Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fdata.2024.1466391","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Advances in high-throughput genome sequencing have enabled large-scale genome sequencing in clinical practice and research studies. By analyzing genomic variants of humans, scientists can gain better understanding of the risk factors of complex diseases such as cancer and COVID-19. To model and analyze the rich genomic data, knowledge graphs (KGs) and graph machine learning (GML) can be regarded as enabling technologies. In this article, we present a scalable tool called VariantKG for analyzing genomic variants of humans modeled using KGs and GML. Specifically, we used publicly available genome sequencing data from patients with COVID-19. VariantKG extracts variant-level genetic information output by a variant calling pipeline, annotates the variant data with additional metadata, and converts the annotated variant information into a KG represented using the Resource Description Framework (RDF). The resulting KG is further enhanced with patient metadata and stored in a scalable graph database that enables efficient RDF indexing and query processing. VariantKG employs the Deep Graph Library (DGL) to perform GML tasks such as node classification. A user can extract a subset of the KG and perform inference tasks using DGL. The user can monitor the training and testing performance and hardware utilization. We tested VariantKG for KG construction by using 1,508 genome sequences, leading to 4 billion RDF statements. We evaluated GML tasks using VariantKG by selecting a subset of 500 sequences from the KG and performing node classification using well-known GML techniques such as GraphSAGE, Graph Convolutional Network (GCN) and Graph Transformer. VariantKG has intuitive user interfaces and features enabling a low barrier to entry for KG construction, model inference, and model interpretation on genomic variants of humans.

Abstract Image

Abstract Image

Abstract Image

使用知识图和图机器学习分析人类基因组变异的可扩展工具。
高通量基因组测序技术的进步使大规模基因组测序在临床实践和研究中得以应用。通过分析人类的基因组变异,科学家可以更好地了解癌症和COVID-19等复杂疾病的危险因素。为了对丰富的基因组数据进行建模和分析,知识图(KGs)和图机器学习(GML)可以被视为使能技术。在本文中,我们介绍了一个可扩展的工具VariantKG,用于分析使用KGs和GML建模的人类基因组变异。具体来说,我们使用了COVID-19患者的公开基因组测序数据。VariantKG提取由变体调用管道输出的变体级遗传信息,用额外的元数据对变体数据进行注释,并将注释的变体信息转换为使用资源描述框架(RDF)表示的KG。生成的KG使用患者元数据进一步增强,并存储在可扩展的图形数据库中,该数据库支持高效的RDF索引和查询处理。VariantKG使用深度图库(DGL)来执行节点分类等GML任务。用户可以提取KG的一个子集,并使用DGL执行推理任务。用户可以监控训练和测试性能以及硬件利用率。我们使用1508个基因组序列对VariantKG进行了KG构建测试,得到了40亿个RDF语句。我们使用VariantKG评估GML任务,从KG中选择500个序列的子集,并使用GraphSAGE、Graph Convolutional Network (GCN)和Graph Transformer等著名的GML技术进行节点分类。VariantKG具有直观的用户界面和功能,可以为人类基因组变异的KG构建,模型推断和模型解释提供低门槛。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
5.20
自引率
3.20%
发文量
122
审稿时长
13 weeks
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信