{"title":"MDGraphEmb: a toolkit for graph embedding and classification of protein conformational ensembles.","authors":"Ferdoos Hossein Nezhad, Namir Oues, Massimiliano Meli, Alessandro Pandini","doi":"10.1093/bioinformatics/btaf420","DOIUrl":null,"url":null,"abstract":"<p><strong>Motivation: </strong>Molecular Dynamics (MD) simulations are essential for investigating protein dynamics and function. Although significant advances have been made in integrating simulation techniques and machine learning, there are still challenges in selecting the most suitable data representation for learning. Graph embedding is a powerful computational method that automatically learns low-dimensional representations of nodes in a graph while preserving graph topology and node properties, thereby bridging graph structures and machine learning methods. Graph embeddings hold great potential for efficiently representing MD simulation data and studying protein dynamics.</p><p><strong>Results: </strong>We present MDGraphEmb, a Python library built on MDAnalysis, specifically designed to convert protein MD simulation trajectories into graph-based representations and corresponding graph embeddings. This transformation enables the compression of high-dimensional, noisy trajectories from protein simulations into tabular formats suitable for machine learning. MDGraphEmb provides a framework that supports a range of graph embedding techniques and machine learning models, enabling the creation of workflows to analyse protein dynamics and identify important protein conformations. Graph embedding effectively captures and compresses structural information from protein MD simulation data, making it applicable to diverse downstream machine-learning classification tasks. We present an application for encoding and detecting important protein conformations from molecular dynamics simulations to classify functional states, using adenylate kinase (ADK) as the main case study. To assess the generalizability of the approach, two additional systems, Plantaricin E (PlnE) and HIV-1 protease are included as supplementary validation examples. A performance comparison of different graph embedding methods combined with machine learning models is also provided.</p><p><strong>Availability and implementation: </strong>MDGraphEMB GitHub Repository: https://github.com/FerdoosHN/MDGraphEMB.</p>","PeriodicalId":93899,"journal":{"name":"Bioinformatics (Oxford, England)","volume":" ","pages":""},"PeriodicalIF":5.4000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12453676/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformatics (Oxford, England)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/bioinformatics/btaf420","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Motivation: Molecular Dynamics (MD) simulations are essential for investigating protein dynamics and function. Although significant advances have been made in integrating simulation techniques and machine learning, there are still challenges in selecting the most suitable data representation for learning. Graph embedding is a powerful computational method that automatically learns low-dimensional representations of nodes in a graph while preserving graph topology and node properties, thereby bridging graph structures and machine learning methods. Graph embeddings hold great potential for efficiently representing MD simulation data and studying protein dynamics.
Results: We present MDGraphEmb, a Python library built on MDAnalysis, specifically designed to convert protein MD simulation trajectories into graph-based representations and corresponding graph embeddings. This transformation enables the compression of high-dimensional, noisy trajectories from protein simulations into tabular formats suitable for machine learning. MDGraphEmb provides a framework that supports a range of graph embedding techniques and machine learning models, enabling the creation of workflows to analyse protein dynamics and identify important protein conformations. Graph embedding effectively captures and compresses structural information from protein MD simulation data, making it applicable to diverse downstream machine-learning classification tasks. We present an application for encoding and detecting important protein conformations from molecular dynamics simulations to classify functional states, using adenylate kinase (ADK) as the main case study. To assess the generalizability of the approach, two additional systems, Plantaricin E (PlnE) and HIV-1 protease are included as supplementary validation examples. A performance comparison of different graph embedding methods combined with machine learning models is also provided.
Availability and implementation: MDGraphEMB GitHub Repository: https://github.com/FerdoosHN/MDGraphEMB.
动机:分子动力学(MD)模拟对于研究蛋白质动力学和功能是必不可少的。尽管在集成模拟技术和机器学习方面取得了重大进展,但在选择最适合学习的数据表示方面仍然存在挑战。图嵌入是一种强大的计算方法,它在保留图拓扑和节点属性的同时自动学习图中节点的低维表示,从而将图结构和机器学习方法连接起来。图嵌入在有效地表示MD模拟数据和研究蛋白质动力学方面具有很大的潜力。结果:我们提出了MDGraphEmb,一个基于MDAnalysis的Python库,专门用于将蛋白质MD模拟轨迹转换为基于图的表示和相应的图嵌入。这种转换可以将蛋白质模拟中的高维噪声轨迹压缩为适合机器学习的表格格式。MDGraphEmb提供了一个支持一系列图嵌入技术和机器学习模型的框架,可以创建工作流程来分析蛋白质动态并识别重要的蛋白质构象。图嵌入有效地捕获和压缩蛋白质MD模拟数据中的结构信息,使其适用于各种下游机器学习分类任务。我们提出了一个应用编码和检测重要的蛋白质构象从分子动力学模拟分类功能状态,以腺苷酸激酶(ADK)为主要案例研究。为了评估该方法的普遍性,另外两个系统,Plantaricin E (PlnE)和HIV-1蛋白酶被作为补充验证示例。结合机器学习模型,对不同的图嵌入方法进行了性能比较。可用性:MDGraphEMB GitHub存储库:https://github.com/FerdoosHN/MDGraphEMB。