{"title":"Pure component property estimation framework using explainable machine learning methods","authors":"Jianfeng Jiao , Xi Gao , Jie Li","doi":"10.1016/j.cjche.2025.05.011","DOIUrl":null,"url":null,"abstract":"<div><div>Accurate prediction of pure component physiochemical properties is crucial for process integration, multiscale modelling, and optimization. In this work, an enhanced framework for pure component property prediction by using explainable machine learning methods is proposed. In this framework, the molecular representation method based on the connectivity matrix effectively considers atomic bonding relationships to automatically generate features. The supervised machine learning model random forest is applied for feature ranking and pooling. The adjusted <em>R</em><sup>2</sup> is introduced to penalize the inclusion of additional features, providing an assessment of the true contribution of features. The prediction results for normal boiling point (<em>T</em><sub><em>b</em></sub>), liquid molar volume (<em>L</em><sub><em>mv</em></sub>), critical temperature (<em>T</em><sub><em>c</em></sub>) and critical pressure (<em>P</em><sub><em>c</em></sub>) obtained using Artificial Neural Network and Gaussian Process Regression models confirm the accuracy of the molecular representation method. Comparison with GC based models shows that the root-mean-square error on the test set can be reduced by up to 83.8%. To enhance the interpretability of the model, a feature analysis method based on Shapley values is employed to determine the contribution of each feature to the property predictions. The results indicate that using the feature pooling method reduces the number of features from 13316 to 100 without compromising model accuracy. The feature analysis results for <em>T</em><sub><em>b</em></sub>, <em>L</em><sub><em>mv</em></sub>, <em>T</em><sub><em>c</em></sub>, and <em>P</em><sub><em>c</em></sub> confirms that different molecular properties are influenced by different structural features, aligning with mechanistic interpretations. In conclusion, the proposed framework is demonstrated to be feasible and provides a solid foundation for mixture component reconstruction and process integration modelling.</div></div>","PeriodicalId":9966,"journal":{"name":"Chinese Journal of Chemical Engineering","volume":"84 ","pages":"Pages 158-178"},"PeriodicalIF":3.7000,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Chinese Journal of Chemical Engineering","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1004954125002095","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ENGINEERING, CHEMICAL","Score":null,"Total":0}
引用次数: 0
Abstract
Accurate prediction of pure component physiochemical properties is crucial for process integration, multiscale modelling, and optimization. In this work, an enhanced framework for pure component property prediction by using explainable machine learning methods is proposed. In this framework, the molecular representation method based on the connectivity matrix effectively considers atomic bonding relationships to automatically generate features. The supervised machine learning model random forest is applied for feature ranking and pooling. The adjusted R2 is introduced to penalize the inclusion of additional features, providing an assessment of the true contribution of features. The prediction results for normal boiling point (Tb), liquid molar volume (Lmv), critical temperature (Tc) and critical pressure (Pc) obtained using Artificial Neural Network and Gaussian Process Regression models confirm the accuracy of the molecular representation method. Comparison with GC based models shows that the root-mean-square error on the test set can be reduced by up to 83.8%. To enhance the interpretability of the model, a feature analysis method based on Shapley values is employed to determine the contribution of each feature to the property predictions. The results indicate that using the feature pooling method reduces the number of features from 13316 to 100 without compromising model accuracy. The feature analysis results for Tb, Lmv, Tc, and Pc confirms that different molecular properties are influenced by different structural features, aligning with mechanistic interpretations. In conclusion, the proposed framework is demonstrated to be feasible and provides a solid foundation for mixture component reconstruction and process integration modelling.
期刊介绍:
The Chinese Journal of Chemical Engineering (Monthly, started in 1982) is the official journal of the Chemical Industry and Engineering Society of China and published by the Chemical Industry Press Co. Ltd. The aim of the journal is to develop the international exchange of scientific and technical information in the field of chemical engineering. It publishes original research papers that cover the major advancements and achievements in chemical engineering in China as well as some articles from overseas contributors.
The topics of journal include chemical engineering, chemical technology, biochemical engineering, energy and environmental engineering and other relevant fields. Papers are published on the basis of their relevance to theoretical research, practical application or potential uses in the industry as Research Papers, Communications, Reviews and Perspectives. Prominent domestic and overseas chemical experts and scholars have been invited to form an International Advisory Board and the Editorial Committee. It enjoys recognition among Chinese academia and industry as a reliable source of information of what is going on in chemical engineering research, both domestic and abroad.