{"title":"有机酸pKa的机器学习预测","authors":"Juda Baikété , Alhadji Malloum , Jeanet Conradie","doi":"10.1016/j.aichem.2025.100092","DOIUrl":null,"url":null,"abstract":"<div><div>The logarithmic acid dissociation constant pKa reflects the ionization of a chemical, which affects lipophilicity, solubility, protein binding, and ability to cross the plasma membrane. It affects the chemical properties of absorption, distribution, metabolism, excretion, and toxicity. Thus, accurate prediction of pKa values is crucial for understanding and modulating the acidity and basicity of organic molecules, with applications in drug discovery, materials science, and environmental chemistry. Here, we present four tree-based machine learning models for pKa prediction of organic molecules. The four models, Random Forest (RF), Extra Trees (ExTr), Histogram Gradient Boosting (HGBoost), and Gradient Boosting (GBoost), were trained on an experimental pKa dataset and tested on SAMPL6 and SAMPL7, two external datasets. Structural and organic parameter (SPOC)-based descriptors were introduced to represent the physicochemical properties of molecules. Further molecular descriptors have been generated using density functional theory (DFT) calculations, and RDKit library. The model trained with the ExTr algorithm showed the best prediction performance with an overall mean absolute error (MAE) value of 1.41 pKa units. Our model (ExTr) outperforms selected models on a range of benchmark data, while offering two unique advantages: (1) full transparency (open descriptors and data) in contrast to proprietary black boxes, and (2) reduced computational cost compared to hybrid QM/ML approaches. While specialized tools like QupKake (MAE <span><math><mo>=</mo></math></span> 0.67) achieve better accuracy, our framework provides an open-source basis for interpretable pKa predictions, efficiently combining molecular physics and machine learning. This model represents a significant advancement in pKa prediction, offering a powerful tool for various applications in chemistry and beyond.</div></div>","PeriodicalId":72302,"journal":{"name":"Artificial intelligence chemistry","volume":"3 2","pages":"Article 100092"},"PeriodicalIF":0.0000,"publicationDate":"2025-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Machine learning prediction of pKa of organic acids\",\"authors\":\"Juda Baikété , Alhadji Malloum , Jeanet Conradie\",\"doi\":\"10.1016/j.aichem.2025.100092\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The logarithmic acid dissociation constant pKa reflects the ionization of a chemical, which affects lipophilicity, solubility, protein binding, and ability to cross the plasma membrane. It affects the chemical properties of absorption, distribution, metabolism, excretion, and toxicity. Thus, accurate prediction of pKa values is crucial for understanding and modulating the acidity and basicity of organic molecules, with applications in drug discovery, materials science, and environmental chemistry. Here, we present four tree-based machine learning models for pKa prediction of organic molecules. The four models, Random Forest (RF), Extra Trees (ExTr), Histogram Gradient Boosting (HGBoost), and Gradient Boosting (GBoost), were trained on an experimental pKa dataset and tested on SAMPL6 and SAMPL7, two external datasets. Structural and organic parameter (SPOC)-based descriptors were introduced to represent the physicochemical properties of molecules. Further molecular descriptors have been generated using density functional theory (DFT) calculations, and RDKit library. The model trained with the ExTr algorithm showed the best prediction performance with an overall mean absolute error (MAE) value of 1.41 pKa units. Our model (ExTr) outperforms selected models on a range of benchmark data, while offering two unique advantages: (1) full transparency (open descriptors and data) in contrast to proprietary black boxes, and (2) reduced computational cost compared to hybrid QM/ML approaches. While specialized tools like QupKake (MAE <span><math><mo>=</mo></math></span> 0.67) achieve better accuracy, our framework provides an open-source basis for interpretable pKa predictions, efficiently combining molecular physics and machine learning. This model represents a significant advancement in pKa prediction, offering a powerful tool for various applications in chemistry and beyond.</div></div>\",\"PeriodicalId\":72302,\"journal\":{\"name\":\"Artificial intelligence chemistry\",\"volume\":\"3 2\",\"pages\":\"Article 100092\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-08-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Artificial intelligence chemistry\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2949747725000090\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial intelligence chemistry","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949747725000090","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Machine learning prediction of pKa of organic acids
The logarithmic acid dissociation constant pKa reflects the ionization of a chemical, which affects lipophilicity, solubility, protein binding, and ability to cross the plasma membrane. It affects the chemical properties of absorption, distribution, metabolism, excretion, and toxicity. Thus, accurate prediction of pKa values is crucial for understanding and modulating the acidity and basicity of organic molecules, with applications in drug discovery, materials science, and environmental chemistry. Here, we present four tree-based machine learning models for pKa prediction of organic molecules. The four models, Random Forest (RF), Extra Trees (ExTr), Histogram Gradient Boosting (HGBoost), and Gradient Boosting (GBoost), were trained on an experimental pKa dataset and tested on SAMPL6 and SAMPL7, two external datasets. Structural and organic parameter (SPOC)-based descriptors were introduced to represent the physicochemical properties of molecules. Further molecular descriptors have been generated using density functional theory (DFT) calculations, and RDKit library. The model trained with the ExTr algorithm showed the best prediction performance with an overall mean absolute error (MAE) value of 1.41 pKa units. Our model (ExTr) outperforms selected models on a range of benchmark data, while offering two unique advantages: (1) full transparency (open descriptors and data) in contrast to proprietary black boxes, and (2) reduced computational cost compared to hybrid QM/ML approaches. While specialized tools like QupKake (MAE 0.67) achieve better accuracy, our framework provides an open-source basis for interpretable pKa predictions, efficiently combining molecular physics and machine learning. This model represents a significant advancement in pKa prediction, offering a powerful tool for various applications in chemistry and beyond.