Pure component property estimation framework using explainable machine learning methods

IF 3.7 3区工程技术 Q2 ENGINEERING, CHEMICAL

Chinese Journal of Chemical Engineering Pub Date : 2025-08-01 DOI:10.1016/j.cjche.2025.05.011

Jianfeng Jiao , Xi Gao , Jie Li

{"title":"Pure component property estimation framework using explainable machine learning methods","authors":"Jianfeng Jiao , Xi Gao , Jie Li","doi":"10.1016/j.cjche.2025.05.011","DOIUrl":null,"url":null,"abstract":"<div><div>Accurate prediction of pure component physiochemical properties is crucial for process integration, multiscale modelling, and optimization. In this work, an enhanced framework for pure component property prediction by using explainable machine learning methods is proposed. In this framework, the molecular representation method based on the connectivity matrix effectively considers atomic bonding relationships to automatically generate features. The supervised machine learning model random forest is applied for feature ranking and pooling. The adjusted R2 is introduced to penalize the inclusion of additional features, providing an assessment of the true contribution of features. The prediction results for normal boiling point (Tb), liquid molar volume (Lmv), critical temperature (Tc) and critical pressure (Pc) obtained using Artificial Neural Network and Gaussian Process Regression models confirm the accuracy of the molecular representation method. Comparison with GC based models shows that the root-mean-square error on the test set can be reduced by up to 83.8%. To enhance the interpretability of the model, a feature analysis method based on Shapley values is employed to determine the contribution of each feature to the property predictions. The results indicate that using the feature pooling method reduces the number of features from 13316 to 100 without compromising model accuracy. The feature analysis results for Tb, Lmv, Tc, and Pc confirms that different molecular properties are influenced by different structural features, aligning with mechanistic interpretations. In conclusion, the proposed framework is demonstrated to be feasible and provides a solid foundation for mixture component reconstruction and process integration modelling.</div></div>","PeriodicalId":9966,"journal":{"name":"Chinese Journal of Chemical Engineering","volume":"84 ","pages":"Pages 158-178"},"PeriodicalIF":3.7000,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chinese Journal of Chemical Engineering","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1004954125002095","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, CHEMICAL","Score":null,"Total":0}

引用次数: 0

Abstract

Accurate prediction of pure component physiochemical properties is crucial for process integration, multiscale modelling, and optimization. In this work, an enhanced framework for pure component property prediction by using explainable machine learning methods is proposed. In this framework, the molecular representation method based on the connectivity matrix effectively considers atomic bonding relationships to automatically generate features. The supervised machine learning model random forest is applied for feature ranking and pooling. The adjusted R² is introduced to penalize the inclusion of additional features, providing an assessment of the true contribution of features. The prediction results for normal boiling point (T_b), liquid molar volume (L_mv), critical temperature (T_c) and critical pressure (P_c) obtained using Artificial Neural Network and Gaussian Process Regression models confirm the accuracy of the molecular representation method. Comparison with GC based models shows that the root-mean-square error on the test set can be reduced by up to 83.8%. To enhance the interpretability of the model, a feature analysis method based on Shapley values is employed to determine the contribution of each feature to the property predictions. The results indicate that using the feature pooling method reduces the number of features from 13316 to 100 without compromising model accuracy. The feature analysis results for T_b, L_mv, T_c, and P_c confirms that different molecular properties are influenced by different structural features, aligning with mechanistic interpretations. In conclusion, the proposed framework is demonstrated to be feasible and provides a solid foundation for mixture component reconstruction and process integration modelling.

查看原文本刊更多论文

使用可解释的机器学习方法的纯组件属性估计框架

对纯组分理化性质的准确预测对于过程集成、多尺度建模和优化至关重要。在这项工作中，提出了一个使用可解释机器学习方法进行纯组件属性预测的增强框架。在该框架中，基于连通性矩阵的分子表示方法有效地考虑了原子成键关系，自动生成特征。采用有监督机器学习模型随机森林进行特征排序和池化。引入调整后的R2来惩罚额外特性的包含，提供对特性真正贡献的评估。利用人工神经网络和高斯过程回归模型对正常沸点（Tb）、液体摩尔体积（Lmv）、临界温度（Tc）和临界压力（Pc）的预测结果证实了分子表示方法的准确性。与基于GC的模型的比较表明，测试集的均方根误差可以降低83.8%。为了提高模型的可解释性，采用基于Shapley值的特征分析方法来确定每个特征对属性预测的贡献。结果表明，使用特征池化方法可以在不影响模型精度的情况下将特征数量从13316减少到100。Tb、Lmv、Tc和Pc的特征分析结果证实，不同的分子性质受到不同结构特征的影响，与机理解释一致。综上所述，该框架是可行的，为混合组件重构和过程集成建模提供了坚实的基础。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Chinese Journal of Chemical Engineering 工程技术-工程：化工

CiteScore

6.60

自引率

5.30%

发文量

4309

审稿时长

31 days

期刊介绍： The Chinese Journal of Chemical Engineering (Monthly, started in 1982) is the official journal of the Chemical Industry and Engineering Society of China and published by the Chemical Industry Press Co. Ltd. The aim of the journal is to develop the international exchange of scientific and technical information in the field of chemical engineering. It publishes original research papers that cover the major advancements and achievements in chemical engineering in China as well as some articles from overseas contributors. The topics of journal include chemical engineering, chemical technology, biochemical engineering, energy and environmental engineering and other relevant fields. Papers are published on the basis of their relevance to theoretical research, practical application or potential uses in the industry as Research Papers, Communications, Reviews and Perspectives. Prominent domestic and overseas chemical experts and scholars have been invited to form an International Advisory Board and the Editorial Committee. It enjoys recognition among Chinese academia and industry as a reliable source of information of what is going on in chemical engineering research, both domestic and abroad.