Muhammad Zia Afzal , Shahid Saeed Siddiqi , Abdul Rauf Nizami
{"title":"Predicting molecular properties using adjacency matrix powers and atomic number sequences","authors":"Muhammad Zia Afzal , Shahid Saeed Siddiqi , Abdul Rauf Nizami","doi":"10.1016/j.ces.2025.122650","DOIUrl":null,"url":null,"abstract":"<div><div>Accurate prediction of physicochemical properties such as boiling points is central to virtual screening, process design, and regulatory assessment in cheminformatics. Here, we introduce a permutation-invariant molecular descriptor derived from powers of a weighted adjacency matrix-where edge weights encode bond types (single=1, double=2, triple=3, aromatic=1.5)-and a sorted atomic number sequence padded to a fixed length. The descriptor captures walk-based topological information while maintaining a compact, fixed-size representation for molecules up to 134 atoms. We benchmark its performance against five established representations (MACCS keys, Morgan fingerprints, Mordred descriptors, Coulomb matrix, Weisfeiler-Lehman graph kernels) using Support Vector Regression, Random Forest, Extreme Gradient Boosting (XGBoost), and a Directed Message-Passing Neural Network (D-MPNN via Chemprop).</div><div>On a diverse dataset of 5432 small organic compounds with experimentally measured boiling points, our descriptor paired with XGBoost achieved a 5-fold cross-validation mean absolute error of 18.52±0.34°C, root mean square error of 27.16±0.32°C, and <span><math><msup><mi>R</mi><mn>2</mn></msup></math></span>=0.898±0.002. While high-dimensional Mordred descriptors yielded the most accurate predictions (MAE=10.54±0.23°C, <span><math><msup><mi>R</mi><mn>2</mn></msup></math></span>=0.960±0.003), our descriptor presents a balanced alternative to simple fingerprints and more complex graph-based representations.</div><div>This study demonstrates that a computationally efficient, walk-based descriptor can achieve robust and interpretable performance in boiling-point regression tasks. Its low dimensionality and generalizability across models make it a promising tool for broader thermophysical property prediction and integration into graph-neural-network architectures.</div></div>","PeriodicalId":271,"journal":{"name":"Chemical Engineering Science","volume":"320 ","pages":"Article 122650"},"PeriodicalIF":4.3000,"publicationDate":"2025-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chemical Engineering Science","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S000925092501471X","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, CHEMICAL","Score":null,"Total":0}
引用次数: 0
Abstract
Accurate prediction of physicochemical properties such as boiling points is central to virtual screening, process design, and regulatory assessment in cheminformatics. Here, we introduce a permutation-invariant molecular descriptor derived from powers of a weighted adjacency matrix-where edge weights encode bond types (single=1, double=2, triple=3, aromatic=1.5)-and a sorted atomic number sequence padded to a fixed length. The descriptor captures walk-based topological information while maintaining a compact, fixed-size representation for molecules up to 134 atoms. We benchmark its performance against five established representations (MACCS keys, Morgan fingerprints, Mordred descriptors, Coulomb matrix, Weisfeiler-Lehman graph kernels) using Support Vector Regression, Random Forest, Extreme Gradient Boosting (XGBoost), and a Directed Message-Passing Neural Network (D-MPNN via Chemprop).
On a diverse dataset of 5432 small organic compounds with experimentally measured boiling points, our descriptor paired with XGBoost achieved a 5-fold cross-validation mean absolute error of 18.52±0.34°C, root mean square error of 27.16±0.32°C, and =0.898±0.002. While high-dimensional Mordred descriptors yielded the most accurate predictions (MAE=10.54±0.23°C, =0.960±0.003), our descriptor presents a balanced alternative to simple fingerprints and more complex graph-based representations.
This study demonstrates that a computationally efficient, walk-based descriptor can achieve robust and interpretable performance in boiling-point regression tasks. Its low dimensionality and generalizability across models make it a promising tool for broader thermophysical property prediction and integration into graph-neural-network architectures.
期刊介绍:
Chemical engineering enables the transformation of natural resources and energy into useful products for society. It draws on and applies natural sciences, mathematics and economics, and has developed fundamental engineering science that underpins the discipline.
Chemical Engineering Science (CES) has been publishing papers on the fundamentals of chemical engineering since 1951. CES is the platform where the most significant advances in the discipline have ever since been published. Chemical Engineering Science has accompanied and sustained chemical engineering through its development into the vibrant and broad scientific discipline it is today.