Arun K. Sharma, Owen McMillan, Selsela Arsala, Supreet Gandhok, Rylend Young
{"title":"Machine learning for asphaltene polarizability: Evaluating molecular descriptors","authors":"Arun K. Sharma, Owen McMillan, Selsela Arsala, Supreet Gandhok, Rylend Young","doi":"10.1016/j.dche.2025.100244","DOIUrl":null,"url":null,"abstract":"<div><div>Asphaltenes are complex polycyclic organic molecules in crude oil that readily aggregate and precipitate under varying thermodynamic conditions. Their structural heterogeneity influences key physicochemical properties, including solubility, stability, and reactivity. Molecular polarizability, a crucial property governing intermolecular interactions and electronic behavior, remains challenging to predict due to this structural diversity. This study employs machine learning models to predict isotropic polarizability using two sets of molecular descriptors: WHIM and GETAWAY. A dataset of 255 asphaltene structures was analyzed using stratified sampling, generating 10 independent training (80 %) and testing (20 %) splits. The Wolfram Language’s Predict function evaluated multiple machine learning algorithms—including Random Forest, Decision Tree, Gradient Boosted Trees, Nearest Neighbors, Linear Regression, Gaussian Process, and Neural Network—through an automated model selection process, serving as an AutoML framework. Linear regression was the best-performing model in 9 out of 10 splits for GETAWAY descriptors. GETAWAY-based models achieved an average mean absolute deviation of 0.0920 ± 0.0030 and standard deviation of 0.113 ± 0.004, significantly outperforming WHIM-based models (MAD = 0.173 ± 0.007, STD = 0.224 ± 0.008) with paired <em>t</em>-tests confirming statistical significance (<em>p</em> < 0.001). While R² values were reported, their interpretability was limited by heterogeneity and narrow property ranges in some test sets. These findings demonstrate the effectiveness of AutoML-guided approaches for predicting molecular properties and identify GETAWAY descriptors as a robust, efficient basis for polarizability prediction. Accurate prediction of polarizability is essential for modeling intermolecular forces and improving force field design in petroleum and materials chemistry, issues that are central to industrial and chemical applications.</div></div>","PeriodicalId":72815,"journal":{"name":"Digital Chemical Engineering","volume":"15 ","pages":"Article 100244"},"PeriodicalIF":3.0000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Chemical Engineering","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772508125000286","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, CHEMICAL","Score":null,"Total":0}
引用次数: 0
Abstract
Asphaltenes are complex polycyclic organic molecules in crude oil that readily aggregate and precipitate under varying thermodynamic conditions. Their structural heterogeneity influences key physicochemical properties, including solubility, stability, and reactivity. Molecular polarizability, a crucial property governing intermolecular interactions and electronic behavior, remains challenging to predict due to this structural diversity. This study employs machine learning models to predict isotropic polarizability using two sets of molecular descriptors: WHIM and GETAWAY. A dataset of 255 asphaltene structures was analyzed using stratified sampling, generating 10 independent training (80 %) and testing (20 %) splits. The Wolfram Language’s Predict function evaluated multiple machine learning algorithms—including Random Forest, Decision Tree, Gradient Boosted Trees, Nearest Neighbors, Linear Regression, Gaussian Process, and Neural Network—through an automated model selection process, serving as an AutoML framework. Linear regression was the best-performing model in 9 out of 10 splits for GETAWAY descriptors. GETAWAY-based models achieved an average mean absolute deviation of 0.0920 ± 0.0030 and standard deviation of 0.113 ± 0.004, significantly outperforming WHIM-based models (MAD = 0.173 ± 0.007, STD = 0.224 ± 0.008) with paired t-tests confirming statistical significance (p < 0.001). While R² values were reported, their interpretability was limited by heterogeneity and narrow property ranges in some test sets. These findings demonstrate the effectiveness of AutoML-guided approaches for predicting molecular properties and identify GETAWAY descriptors as a robust, efficient basis for polarizability prediction. Accurate prediction of polarizability is essential for modeling intermolecular forces and improving force field design in petroleum and materials chemistry, issues that are central to industrial and chemical applications.