Physically interpretable performance metrics for clustering.

IF 3.1 2区化学 Q3 CHEMISTRY, PHYSICAL

Journal of Chemical Physics Pub Date : 2024-12-28 DOI:10.1063/5.0241122

Kinjal Mondal, Jeffery B Klauda

{"title":"Physically interpretable performance metrics for clustering.","authors":"Kinjal Mondal, Jeffery B Klauda","doi":"10.1063/5.0241122","DOIUrl":null,"url":null,"abstract":"<p><p>Clustering is a type of machine learning technique, which is used to group huge amounts of data based on their similarity into separate groups or clusters. Clustering is a very important task that is nowadays used to analyze the huge and diverse amount of data coming out of molecular dynamics (MD) simulations. Typically, the data from the MD simulations in terms of their various frames in the trajectory are clustered into different groups and a representative element from each group is studied separately. Now, a very important question coming in this process is: what is the quality of the clusters that are obtained? There are several performance metrics that are available in the literature such as the silhouette index and the Davies-Bouldin Index that are often used to analyze the quality of clustering. However, most of these metrics focus on the overlap or the similarity of the clusters in the reduced dimension that is used for clustering and do not focus on the physically important properties or the parameters of the system. To address this issue, we have developed two physically interpretable scoring metrics that focus on the physical parameters of the system that we are analyzing. We have used and tested our algorithm on three different systems: (1) Ising model, (2) peptide folding and unfolding of WT HP35, (3) a protein-ligand trajectory of an enzyme and substrate, and (4) a protein-ligand dissociated trajectory. We show that the scoring metrics provide us clusters that match with our physical intuition about the systems.</p>","PeriodicalId":15313,"journal":{"name":"Journal of Chemical Physics","volume":"161 24","pages":""},"PeriodicalIF":3.1000,"publicationDate":"2024-12-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Chemical Physics","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1063/5.0241122","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"CHEMISTRY, PHYSICAL","Score":null,"Total":0}

引用次数: 0

Abstract

Clustering is a type of machine learning technique, which is used to group huge amounts of data based on their similarity into separate groups or clusters. Clustering is a very important task that is nowadays used to analyze the huge and diverse amount of data coming out of molecular dynamics (MD) simulations. Typically, the data from the MD simulations in terms of their various frames in the trajectory are clustered into different groups and a representative element from each group is studied separately. Now, a very important question coming in this process is: what is the quality of the clusters that are obtained? There are several performance metrics that are available in the literature such as the silhouette index and the Davies-Bouldin Index that are often used to analyze the quality of clustering. However, most of these metrics focus on the overlap or the similarity of the clusters in the reduced dimension that is used for clustering and do not focus on the physically important properties or the parameters of the system. To address this issue, we have developed two physically interpretable scoring metrics that focus on the physical parameters of the system that we are analyzing. We have used and tested our algorithm on three different systems: (1) Ising model, (2) peptide folding and unfolding of WT HP35, (3) a protein-ligand trajectory of an enzyme and substrate, and (4) a protein-ligand dissociated trajectory. We show that the scoring metrics provide us clusters that match with our physical intuition about the systems.

查看原文本刊更多论文

用于集群的物理可解释的性能指标。

聚类是一种机器学习技术，用于根据相似性将大量数据分组到单独的组或集群中。聚类是一项非常重要的任务，用于分析来自分子动力学（MD）模拟的大量不同数据。通常情况下，根据弹道中不同帧的弹道模拟数据聚类成不同的组，并从每组中分别研究一个有代表性的元素。现在，在这个过程中出现了一个非常重要的问题：获得的群集的质量如何？文献中有几个可用的性能指标，如轮廓指数和戴维斯-博尔丁指数，它们通常用于分析聚类的质量。然而，这些度量大多关注用于聚类的降维中集群的重叠或相似性，而不关注系统的物理重要属性或参数。为了解决这个问题，我们开发了两个物理上可解释的评分指标，它们关注于我们正在分析的系统的物理参数。我们已经在三个不同的系统上使用并测试了我们的算法：(1)Ising模型，(2)WT HP35的肽折叠和展开，(3)酶和底物的蛋白质-配体轨迹，以及(4)蛋白质-配体解离轨迹。我们展示了评分指标为我们提供了与我们对系统的物理直觉相匹配的集群。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Chemical Physics 物理-物理：原子、分子和化学物理

CiteScore

7.40

自引率

15.90%

发文量

1615

审稿时长

2 months

期刊介绍： The Journal of Chemical Physics publishes quantitative and rigorous science of long-lasting value in methods and applications of chemical physics. The Journal also publishes brief Communications of significant new findings, Perspectives on the latest advances in the field, and Special Topic issues. The Journal focuses on innovative research in experimental and theoretical areas of chemical physics, including spectroscopy, dynamics, kinetics, statistical mechanics, and quantum mechanics. In addition, topical areas such as polymers, soft matter, materials, surfaces/interfaces, and systems of biological relevance are of increasing importance. Topical coverage includes: Theoretical Methods and Algorithms Advanced Experimental Techniques Atoms, Molecules, and Clusters Liquids, Glasses, and Crystals Surfaces, Interfaces, and Materials Polymers and Soft Matter Biological Molecules and Networks.