A molecular descriptor-based correlation with the composition of acid-pretreated cornstalk cultivation medium for biohydrogen production using a machine learning approach
Xiyue Zhang , Yixiao Wang , Jing Hu , Qingyue Zhang , Xiaoting Xuan , Lufang Shi , Yong Sun
{"title":"A molecular descriptor-based correlation with the composition of acid-pretreated cornstalk cultivation medium for biohydrogen production using a machine learning approach","authors":"Xiyue Zhang , Yixiao Wang , Jing Hu , Qingyue Zhang , Xiaoting Xuan , Lufang Shi , Yong Sun","doi":"10.1016/j.ijhydene.2025.03.400","DOIUrl":null,"url":null,"abstract":"<div><div>In this work, the machine learning (ML) was used to examine the relationship between physiochemical properties and concentration levels of 50 typical compounds derived from cornstalk acid hydrolysates during lignocellulosic pretreatment. These compounds, selected to represent the chemical matrix (with <32 % similarity), were analyzed using RDKit's MolecularDescriptorCalculator (MDC), which effectively reduced the number of extended-connectivity fingerprints (ECFP4) from 366 chemical descriptors to 19 key descriptors. Notably, compounds such as glucose, fructose, furfural, lactic acid, acetate, formic acid, 4-hydroxy-3-methoxycinnamic acid, and citric acid exhibited consistent hierarchical clustering in cultivation media before (Con_int) and after (Con_aft) fermentation. The chemical descriptors of Gasteiger charge and LogP were effective in illustrating subtle differences for those compounds. The TensorFlow (TF), demonstrated a stronger correlation (R<sup>2</sup>>75 %) between chemical descriptors and pre-fermentation concentrations (Con_int) compared to post-fermentation (Con_aft) from regression model evaluation. SHapley Additive exPlanations (SHAP) analysis was applied using TF algorithm to interpret the chemical properties that influence level of compounds in fermentation cultivation medium, with LogP, Gasteiger charge, and aromatic ring counts being the most influential for Con_int, and Kappa1, radius of gyration, and hydrogen donors for Con_aft. The lignocellulosic acid hydrolysates compounds library (LAHCL) was also constructed for future exploration of potential compounds during biohydrogen fermentation based on cheminformatics study. This cheminformatics approach offers valuable insights into predicting compound concentrations, biological activity and pool of relevant compounds for dark fermentation with reasonable accuracy.</div></div>","PeriodicalId":337,"journal":{"name":"International Journal of Hydrogen Energy","volume":"123 ","pages":""},"PeriodicalIF":8.1000,"publicationDate":"2025-04-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Hydrogen Energy","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0360319925015526","RegionNum":2,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, PHYSICAL","Score":null,"Total":0}
引用次数: 0
Abstract
In this work, the machine learning (ML) was used to examine the relationship between physiochemical properties and concentration levels of 50 typical compounds derived from cornstalk acid hydrolysates during lignocellulosic pretreatment. These compounds, selected to represent the chemical matrix (with <32 % similarity), were analyzed using RDKit's MolecularDescriptorCalculator (MDC), which effectively reduced the number of extended-connectivity fingerprints (ECFP4) from 366 chemical descriptors to 19 key descriptors. Notably, compounds such as glucose, fructose, furfural, lactic acid, acetate, formic acid, 4-hydroxy-3-methoxycinnamic acid, and citric acid exhibited consistent hierarchical clustering in cultivation media before (Con_int) and after (Con_aft) fermentation. The chemical descriptors of Gasteiger charge and LogP were effective in illustrating subtle differences for those compounds. The TensorFlow (TF), demonstrated a stronger correlation (R2>75 %) between chemical descriptors and pre-fermentation concentrations (Con_int) compared to post-fermentation (Con_aft) from regression model evaluation. SHapley Additive exPlanations (SHAP) analysis was applied using TF algorithm to interpret the chemical properties that influence level of compounds in fermentation cultivation medium, with LogP, Gasteiger charge, and aromatic ring counts being the most influential for Con_int, and Kappa1, radius of gyration, and hydrogen donors for Con_aft. The lignocellulosic acid hydrolysates compounds library (LAHCL) was also constructed for future exploration of potential compounds during biohydrogen fermentation based on cheminformatics study. This cheminformatics approach offers valuable insights into predicting compound concentrations, biological activity and pool of relevant compounds for dark fermentation with reasonable accuracy.
期刊介绍:
The objective of the International Journal of Hydrogen Energy is to facilitate the exchange of new ideas, technological advancements, and research findings in the field of Hydrogen Energy among scientists and engineers worldwide. This journal showcases original research, both analytical and experimental, covering various aspects of Hydrogen Energy. These include production, storage, transmission, utilization, enabling technologies, environmental impact, economic considerations, and global perspectives on hydrogen and its carriers such as NH3, CH4, alcohols, etc.
The utilization aspect encompasses various methods such as thermochemical (combustion), photochemical, electrochemical (fuel cells), and nuclear conversion of hydrogen, hydrogen isotopes, and hydrogen carriers into thermal, mechanical, and electrical energies. The applications of these energies can be found in transportation (including aerospace), industrial, commercial, and residential sectors.