Machine learning for asphaltene polarizability: Evaluating molecular descriptors

IF 3 Q2 ENGINEERING, CHEMICAL
Arun K. Sharma, Owen McMillan, Selsela Arsala, Supreet Gandhok, Rylend Young
{"title":"Machine learning for asphaltene polarizability: Evaluating molecular descriptors","authors":"Arun K. Sharma,&nbsp;Owen McMillan,&nbsp;Selsela Arsala,&nbsp;Supreet Gandhok,&nbsp;Rylend Young","doi":"10.1016/j.dche.2025.100244","DOIUrl":null,"url":null,"abstract":"<div><div>Asphaltenes are complex polycyclic organic molecules in crude oil that readily aggregate and precipitate under varying thermodynamic conditions. Their structural heterogeneity influences key physicochemical properties, including solubility, stability, and reactivity. Molecular polarizability, a crucial property governing intermolecular interactions and electronic behavior, remains challenging to predict due to this structural diversity. This study employs machine learning models to predict isotropic polarizability using two sets of molecular descriptors: WHIM and GETAWAY. A dataset of 255 asphaltene structures was analyzed using stratified sampling, generating 10 independent training (80 %) and testing (20 %) splits. The Wolfram Language’s Predict function evaluated multiple machine learning algorithms—including Random Forest, Decision Tree, Gradient Boosted Trees, Nearest Neighbors, Linear Regression, Gaussian Process, and Neural Network—through an automated model selection process, serving as an AutoML framework. Linear regression was the best-performing model in 9 out of 10 splits for GETAWAY descriptors. GETAWAY-based models achieved an average mean absolute deviation of 0.0920 ± 0.0030 and standard deviation of 0.113 ± 0.004, significantly outperforming WHIM-based models (MAD = 0.173 ± 0.007, STD = 0.224 ± 0.008) with paired <em>t</em>-tests confirming statistical significance (<em>p</em> &lt; 0.001). While R² values were reported, their interpretability was limited by heterogeneity and narrow property ranges in some test sets. These findings demonstrate the effectiveness of AutoML-guided approaches for predicting molecular properties and identify GETAWAY descriptors as a robust, efficient basis for polarizability prediction. Accurate prediction of polarizability is essential for modeling intermolecular forces and improving force field design in petroleum and materials chemistry, issues that are central to industrial and chemical applications.</div></div>","PeriodicalId":72815,"journal":{"name":"Digital Chemical Engineering","volume":"15 ","pages":"Article 100244"},"PeriodicalIF":3.0000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital Chemical Engineering","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772508125000286","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, CHEMICAL","Score":null,"Total":0}
引用次数: 0

Abstract

Asphaltenes are complex polycyclic organic molecules in crude oil that readily aggregate and precipitate under varying thermodynamic conditions. Their structural heterogeneity influences key physicochemical properties, including solubility, stability, and reactivity. Molecular polarizability, a crucial property governing intermolecular interactions and electronic behavior, remains challenging to predict due to this structural diversity. This study employs machine learning models to predict isotropic polarizability using two sets of molecular descriptors: WHIM and GETAWAY. A dataset of 255 asphaltene structures was analyzed using stratified sampling, generating 10 independent training (80 %) and testing (20 %) splits. The Wolfram Language’s Predict function evaluated multiple machine learning algorithms—including Random Forest, Decision Tree, Gradient Boosted Trees, Nearest Neighbors, Linear Regression, Gaussian Process, and Neural Network—through an automated model selection process, serving as an AutoML framework. Linear regression was the best-performing model in 9 out of 10 splits for GETAWAY descriptors. GETAWAY-based models achieved an average mean absolute deviation of 0.0920 ± 0.0030 and standard deviation of 0.113 ± 0.004, significantly outperforming WHIM-based models (MAD = 0.173 ± 0.007, STD = 0.224 ± 0.008) with paired t-tests confirming statistical significance (p < 0.001). While R² values were reported, their interpretability was limited by heterogeneity and narrow property ranges in some test sets. These findings demonstrate the effectiveness of AutoML-guided approaches for predicting molecular properties and identify GETAWAY descriptors as a robust, efficient basis for polarizability prediction. Accurate prediction of polarizability is essential for modeling intermolecular forces and improving force field design in petroleum and materials chemistry, issues that are central to industrial and chemical applications.
沥青质极化的机器学习:评估分子描述符
沥青质是原油中复杂的多环有机分子,在不同的热力学条件下容易聚集和沉淀。它们的结构非均质性影响关键的物理化学性质,包括溶解度、稳定性和反应性。分子极化率是控制分子间相互作用和电子行为的关键性质,由于这种结构多样性,预测分子极化率仍然具有挑战性。本研究采用机器学习模型,使用两组分子描述符:WHIM和GETAWAY来预测各向同性极化率。采用分层抽样的方法分析了255个沥青质结构的数据集,生成了10个独立的训练(80%)和测试(20%)分裂。Wolfram语言的预测函数通过自动模型选择过程评估多种机器学习算法,包括随机森林、决策树、梯度增强树、最近邻、线性回归、高斯过程和神经网络,作为AutoML框架。线性回归是10分中的9分中表现最好的模型。基于getaway的模型平均绝对偏差为0.0920±0.0030,标准差为0.113±0.004,显著优于基于whim的模型(MAD = 0.173±0.007,STD = 0.224±0.008),配对t检验证实具有统计学意义(p <;0.001)。虽然报告了R²值,但在一些测试集中,它们的可解释性受到异质性和狭窄属性范围的限制。这些发现证明了automl引导方法在预测分子性质方面的有效性,并将escape描述符确定为极化率预测的稳健、有效的基础。极化率的准确预测对于石油和材料化学中的分子间作用力建模和改进力场设计至关重要,这些问题对工业和化学应用至关重要。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
3.10
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信