分子性质预测的子图同构决策树。

IF 2.8 2区 化学 Q3 CHEMISTRY, PHYSICAL
Matthew S Johnson, Hao-Wei Pang, Anna C Doner, William H Green, Judit Zádor
{"title":"分子性质预测的子图同构决策树。","authors":"Matthew S Johnson, Hao-Wei Pang, Anna C Doner, William H Green, Judit Zádor","doi":"10.1021/acs.jpca.5c04483","DOIUrl":null,"url":null,"abstract":"<p><p>Accurate molecular property prediction is important across all fields of chemistry. Deep neural networks (DNNs) have become increasingly popular due to their ability to train automatically, avoiding the incredibly tedious process of constructing and extending traditional property estimation schemes. However, DNNs require large amounts of training data, are challenging to interpret, require large amounts of memory to load even during inference, and have severe difficulties incorporating qualitative chemical knowledge, which are often desired for molecular property prediction tasks. Here we present PySIDT (https://github.com/zadorlab/PySIDT), a software for training and running inference on Subgraph Isomorphic Decision Trees (SIDTs). SIDTs are graph-based decision trees made of nodes associated with molecular substructures. Inference is done by descending target molecular structures down the decision tree to nodes with matching subgraph isomorphic substructures and making predictions based on the final (most specific) nodes matched. SIDTs scale down well to dataset sizes much smaller than is feasible for DNNs. As trees of molecular substructures, SIDTs are inherently readable and easy to visualize, making them easy to analyze. They are also straightforward to extend and retrain, facilitate uncertainty estimation, and enable easy integration of expert knowledge. We demonstrate the SIDT approach discussing its application to a diverse range of molecular prediction tasks: rate coefficient estimation, diffusion coefficient estimation, thermochemistry estimation, transition state bond stretch prediction, p<i>K</i><sub>a</sub> prediction, stability of molecular structures, stability of surface structures, and prediction of surface lateral interaction energetics. Additionally, we demonstrate the power of the SIDT algorithms in two direct learning curve vanilla comparisons with the popular DNN-based software Chemprop and the popular gradient boosted trees-based software XGBoost on enthalpy of formation and rate coefficient prediction tasks. In particular, in the enthalpy of formation case, vanilla PySIDT is able to outperform vanilla Chemprop and XGBoost across the full range of training/validation set sizes out to 11,560 data points.</p>","PeriodicalId":59,"journal":{"name":"The Journal of Physical Chemistry A","volume":" ","pages":""},"PeriodicalIF":2.8000,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"PySIDT: Subgraph Isomorphic Decision Trees for Molecular Property Prediction.\",\"authors\":\"Matthew S Johnson, Hao-Wei Pang, Anna C Doner, William H Green, Judit Zádor\",\"doi\":\"10.1021/acs.jpca.5c04483\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Accurate molecular property prediction is important across all fields of chemistry. Deep neural networks (DNNs) have become increasingly popular due to their ability to train automatically, avoiding the incredibly tedious process of constructing and extending traditional property estimation schemes. However, DNNs require large amounts of training data, are challenging to interpret, require large amounts of memory to load even during inference, and have severe difficulties incorporating qualitative chemical knowledge, which are often desired for molecular property prediction tasks. Here we present PySIDT (https://github.com/zadorlab/PySIDT), a software for training and running inference on Subgraph Isomorphic Decision Trees (SIDTs). SIDTs are graph-based decision trees made of nodes associated with molecular substructures. Inference is done by descending target molecular structures down the decision tree to nodes with matching subgraph isomorphic substructures and making predictions based on the final (most specific) nodes matched. SIDTs scale down well to dataset sizes much smaller than is feasible for DNNs. As trees of molecular substructures, SIDTs are inherently readable and easy to visualize, making them easy to analyze. They are also straightforward to extend and retrain, facilitate uncertainty estimation, and enable easy integration of expert knowledge. We demonstrate the SIDT approach discussing its application to a diverse range of molecular prediction tasks: rate coefficient estimation, diffusion coefficient estimation, thermochemistry estimation, transition state bond stretch prediction, p<i>K</i><sub>a</sub> prediction, stability of molecular structures, stability of surface structures, and prediction of surface lateral interaction energetics. Additionally, we demonstrate the power of the SIDT algorithms in two direct learning curve vanilla comparisons with the popular DNN-based software Chemprop and the popular gradient boosted trees-based software XGBoost on enthalpy of formation and rate coefficient prediction tasks. In particular, in the enthalpy of formation case, vanilla PySIDT is able to outperform vanilla Chemprop and XGBoost across the full range of training/validation set sizes out to 11,560 data points.</p>\",\"PeriodicalId\":59,\"journal\":{\"name\":\"The Journal of Physical Chemistry A\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2025-10-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The Journal of Physical Chemistry A\",\"FirstCategoryId\":\"1\",\"ListUrlMain\":\"https://doi.org/10.1021/acs.jpca.5c04483\",\"RegionNum\":2,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"CHEMISTRY, PHYSICAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Journal of Physical Chemistry A","FirstCategoryId":"1","ListUrlMain":"https://doi.org/10.1021/acs.jpca.5c04483","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"CHEMISTRY, PHYSICAL","Score":null,"Total":0}
引用次数: 0

摘要

准确的分子性质预测在化学的各个领域都很重要。深度神经网络(dnn)由于其自动训练的能力而变得越来越受欢迎,避免了构建和扩展传统属性估计方案的令人难以置信的繁琐过程。然而,深度神经网络需要大量的训练数据,难以解释,甚至在推理过程中也需要大量的内存,并且在整合定性化学知识方面存在严重困难,而定性化学知识通常用于分子性质预测任务。在这里,我们介绍PySIDT (https://github.com/zadorlab/PySIDT),一个用于训练和运行子图同构决策树(sidt)推理的软件。sidt是由与分子子结构相关的节点组成的基于图的决策树。推理是通过将目标分子结构沿决策树向下下降到具有匹配子图同构子结构的节点,并基于匹配的最终(最具体)节点进行预测来完成的。sidt可以很好地缩小到比dnn小得多的数据集大小。作为分子亚结构树,sidt具有固有的可读性和可视化性,因此易于分析。它们还可以直接扩展和重新训练,促进不确定性估计,并使专家知识易于集成。我们展示了SIDT方法,讨论了它在各种分子预测任务中的应用:速率系数估计、扩散系数估计、热化学估计、过渡态键拉伸预测、pKa预测、分子结构稳定性、表面结构稳定性和表面横向相互作用能量预测。此外,我们通过与流行的基于dnn的软件Chemprop和流行的基于梯度增强树的软件XGBoost进行两次直接学习曲线比较,证明了SIDT算法在地层焓和速率系数预测任务上的强大功能。特别是,在地层焓的情况下,在整个训练/验证集规模(11,560个数据点)范围内,香草PySIDT能够优于香草Chemprop和XGBoost。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
PySIDT: Subgraph Isomorphic Decision Trees for Molecular Property Prediction.

Accurate molecular property prediction is important across all fields of chemistry. Deep neural networks (DNNs) have become increasingly popular due to their ability to train automatically, avoiding the incredibly tedious process of constructing and extending traditional property estimation schemes. However, DNNs require large amounts of training data, are challenging to interpret, require large amounts of memory to load even during inference, and have severe difficulties incorporating qualitative chemical knowledge, which are often desired for molecular property prediction tasks. Here we present PySIDT (https://github.com/zadorlab/PySIDT), a software for training and running inference on Subgraph Isomorphic Decision Trees (SIDTs). SIDTs are graph-based decision trees made of nodes associated with molecular substructures. Inference is done by descending target molecular structures down the decision tree to nodes with matching subgraph isomorphic substructures and making predictions based on the final (most specific) nodes matched. SIDTs scale down well to dataset sizes much smaller than is feasible for DNNs. As trees of molecular substructures, SIDTs are inherently readable and easy to visualize, making them easy to analyze. They are also straightforward to extend and retrain, facilitate uncertainty estimation, and enable easy integration of expert knowledge. We demonstrate the SIDT approach discussing its application to a diverse range of molecular prediction tasks: rate coefficient estimation, diffusion coefficient estimation, thermochemistry estimation, transition state bond stretch prediction, pKa prediction, stability of molecular structures, stability of surface structures, and prediction of surface lateral interaction energetics. Additionally, we demonstrate the power of the SIDT algorithms in two direct learning curve vanilla comparisons with the popular DNN-based software Chemprop and the popular gradient boosted trees-based software XGBoost on enthalpy of formation and rate coefficient prediction tasks. In particular, in the enthalpy of formation case, vanilla PySIDT is able to outperform vanilla Chemprop and XGBoost across the full range of training/validation set sizes out to 11,560 data points.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
The Journal of Physical Chemistry A
The Journal of Physical Chemistry A 化学-物理:原子、分子和化学物理
CiteScore
5.20
自引率
10.30%
发文量
922
审稿时长
1.3 months
期刊介绍: The Journal of Physical Chemistry A is devoted to reporting new and original experimental and theoretical basic research of interest to physical chemists, biophysical chemists, and chemical physicists.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信