Matthew S Johnson, Hao-Wei Pang, Anna C Doner, William H Green, Judit Zádor
{"title":"分子性质预测的子图同构决策树。","authors":"Matthew S Johnson, Hao-Wei Pang, Anna C Doner, William H Green, Judit Zádor","doi":"10.1021/acs.jpca.5c04483","DOIUrl":null,"url":null,"abstract":"<p><p>Accurate molecular property prediction is important across all fields of chemistry. Deep neural networks (DNNs) have become increasingly popular due to their ability to train automatically, avoiding the incredibly tedious process of constructing and extending traditional property estimation schemes. However, DNNs require large amounts of training data, are challenging to interpret, require large amounts of memory to load even during inference, and have severe difficulties incorporating qualitative chemical knowledge, which are often desired for molecular property prediction tasks. Here we present PySIDT (https://github.com/zadorlab/PySIDT), a software for training and running inference on Subgraph Isomorphic Decision Trees (SIDTs). SIDTs are graph-based decision trees made of nodes associated with molecular substructures. Inference is done by descending target molecular structures down the decision tree to nodes with matching subgraph isomorphic substructures and making predictions based on the final (most specific) nodes matched. SIDTs scale down well to dataset sizes much smaller than is feasible for DNNs. As trees of molecular substructures, SIDTs are inherently readable and easy to visualize, making them easy to analyze. They are also straightforward to extend and retrain, facilitate uncertainty estimation, and enable easy integration of expert knowledge. We demonstrate the SIDT approach discussing its application to a diverse range of molecular prediction tasks: rate coefficient estimation, diffusion coefficient estimation, thermochemistry estimation, transition state bond stretch prediction, p<i>K</i><sub>a</sub> prediction, stability of molecular structures, stability of surface structures, and prediction of surface lateral interaction energetics. Additionally, we demonstrate the power of the SIDT algorithms in two direct learning curve vanilla comparisons with the popular DNN-based software Chemprop and the popular gradient boosted trees-based software XGBoost on enthalpy of formation and rate coefficient prediction tasks. In particular, in the enthalpy of formation case, vanilla PySIDT is able to outperform vanilla Chemprop and XGBoost across the full range of training/validation set sizes out to 11,560 data points.</p>","PeriodicalId":59,"journal":{"name":"The Journal of Physical Chemistry A","volume":" ","pages":""},"PeriodicalIF":2.8000,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"PySIDT: Subgraph Isomorphic Decision Trees for Molecular Property Prediction.\",\"authors\":\"Matthew S Johnson, Hao-Wei Pang, Anna C Doner, William H Green, Judit Zádor\",\"doi\":\"10.1021/acs.jpca.5c04483\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Accurate molecular property prediction is important across all fields of chemistry. Deep neural networks (DNNs) have become increasingly popular due to their ability to train automatically, avoiding the incredibly tedious process of constructing and extending traditional property estimation schemes. However, DNNs require large amounts of training data, are challenging to interpret, require large amounts of memory to load even during inference, and have severe difficulties incorporating qualitative chemical knowledge, which are often desired for molecular property prediction tasks. Here we present PySIDT (https://github.com/zadorlab/PySIDT), a software for training and running inference on Subgraph Isomorphic Decision Trees (SIDTs). SIDTs are graph-based decision trees made of nodes associated with molecular substructures. Inference is done by descending target molecular structures down the decision tree to nodes with matching subgraph isomorphic substructures and making predictions based on the final (most specific) nodes matched. SIDTs scale down well to dataset sizes much smaller than is feasible for DNNs. As trees of molecular substructures, SIDTs are inherently readable and easy to visualize, making them easy to analyze. They are also straightforward to extend and retrain, facilitate uncertainty estimation, and enable easy integration of expert knowledge. We demonstrate the SIDT approach discussing its application to a diverse range of molecular prediction tasks: rate coefficient estimation, diffusion coefficient estimation, thermochemistry estimation, transition state bond stretch prediction, p<i>K</i><sub>a</sub> prediction, stability of molecular structures, stability of surface structures, and prediction of surface lateral interaction energetics. Additionally, we demonstrate the power of the SIDT algorithms in two direct learning curve vanilla comparisons with the popular DNN-based software Chemprop and the popular gradient boosted trees-based software XGBoost on enthalpy of formation and rate coefficient prediction tasks. In particular, in the enthalpy of formation case, vanilla PySIDT is able to outperform vanilla Chemprop and XGBoost across the full range of training/validation set sizes out to 11,560 data points.</p>\",\"PeriodicalId\":59,\"journal\":{\"name\":\"The Journal of Physical Chemistry A\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.8000,\"publicationDate\":\"2025-10-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The Journal of Physical Chemistry A\",\"FirstCategoryId\":\"1\",\"ListUrlMain\":\"https://doi.org/10.1021/acs.jpca.5c04483\",\"RegionNum\":2,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"CHEMISTRY, PHYSICAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The Journal of Physical Chemistry A","FirstCategoryId":"1","ListUrlMain":"https://doi.org/10.1021/acs.jpca.5c04483","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"CHEMISTRY, PHYSICAL","Score":null,"Total":0}
PySIDT: Subgraph Isomorphic Decision Trees for Molecular Property Prediction.
Accurate molecular property prediction is important across all fields of chemistry. Deep neural networks (DNNs) have become increasingly popular due to their ability to train automatically, avoiding the incredibly tedious process of constructing and extending traditional property estimation schemes. However, DNNs require large amounts of training data, are challenging to interpret, require large amounts of memory to load even during inference, and have severe difficulties incorporating qualitative chemical knowledge, which are often desired for molecular property prediction tasks. Here we present PySIDT (https://github.com/zadorlab/PySIDT), a software for training and running inference on Subgraph Isomorphic Decision Trees (SIDTs). SIDTs are graph-based decision trees made of nodes associated with molecular substructures. Inference is done by descending target molecular structures down the decision tree to nodes with matching subgraph isomorphic substructures and making predictions based on the final (most specific) nodes matched. SIDTs scale down well to dataset sizes much smaller than is feasible for DNNs. As trees of molecular substructures, SIDTs are inherently readable and easy to visualize, making them easy to analyze. They are also straightforward to extend and retrain, facilitate uncertainty estimation, and enable easy integration of expert knowledge. We demonstrate the SIDT approach discussing its application to a diverse range of molecular prediction tasks: rate coefficient estimation, diffusion coefficient estimation, thermochemistry estimation, transition state bond stretch prediction, pKa prediction, stability of molecular structures, stability of surface structures, and prediction of surface lateral interaction energetics. Additionally, we demonstrate the power of the SIDT algorithms in two direct learning curve vanilla comparisons with the popular DNN-based software Chemprop and the popular gradient boosted trees-based software XGBoost on enthalpy of formation and rate coefficient prediction tasks. In particular, in the enthalpy of formation case, vanilla PySIDT is able to outperform vanilla Chemprop and XGBoost across the full range of training/validation set sizes out to 11,560 data points.
期刊介绍:
The Journal of Physical Chemistry A is devoted to reporting new and original experimental and theoretical basic research of interest to physical chemists, biophysical chemists, and chemical physicists.