有机酸pKa的机器学习预测

Artificial intelligence chemistry Pub Date : 2025-08-08 DOI:10.1016/j.aichem.2025.100092

Juda Baikété , Alhadji Malloum , Jeanet Conradie

{"title":"有机酸pKa的机器学习预测","authors":"Juda Baikété , Alhadji Malloum , Jeanet Conradie","doi":"10.1016/j.aichem.2025.100092","DOIUrl":null,"url":null,"abstract":"<div><div>The logarithmic acid dissociation constant pKa reflects the ionization of a chemical, which affects lipophilicity, solubility, protein binding, and ability to cross the plasma membrane. It affects the chemical properties of absorption, distribution, metabolism, excretion, and toxicity. Thus, accurate prediction of pKa values is crucial for understanding and modulating the acidity and basicity of organic molecules, with applications in drug discovery, materials science, and environmental chemistry. Here, we present four tree-based machine learning models for pKa prediction of organic molecules. The four models, Random Forest (RF), Extra Trees (ExTr), Histogram Gradient Boosting (HGBoost), and Gradient Boosting (GBoost), were trained on an experimental pKa dataset and tested on SAMPL6 and SAMPL7, two external datasets. Structural and organic parameter (SPOC)-based descriptors were introduced to represent the physicochemical properties of molecules. Further molecular descriptors have been generated using density functional theory (DFT) calculations, and RDKit library. The model trained with the ExTr algorithm showed the best prediction performance with an overall mean absolute error (MAE) value of 1.41 pKa units. Our model (ExTr) outperforms selected models on a range of benchmark data, while offering two unique advantages: (1) full transparency (open descriptors and data) in contrast to proprietary black boxes, and (2) reduced computational cost compared to hybrid QM/ML approaches. While specialized tools like QupKake (MAE <span><math><mo>=</mo></math></span> 0.67) achieve better accuracy, our framework provides an open-source basis for interpretable pKa predictions, efficiently combining molecular physics and machine learning. This model represents a significant advancement in pKa prediction, offering a powerful tool for various applications in chemistry and beyond.</div></div>","PeriodicalId":72302,"journal":{"name":"Artificial intelligence chemistry","volume":"3 2","pages":"Article 100092"},"PeriodicalIF":0.0000,"publicationDate":"2025-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Machine learning prediction of pKa of organic acids\",\"authors\":\"Juda Baikété , Alhadji Malloum , Jeanet Conradie\",\"doi\":\"10.1016/j.aichem.2025.100092\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>The logarithmic acid dissociation constant pKa reflects the ionization of a chemical, which affects lipophilicity, solubility, protein binding, and ability to cross the plasma membrane. It affects the chemical properties of absorption, distribution, metabolism, excretion, and toxicity. Thus, accurate prediction of pKa values is crucial for understanding and modulating the acidity and basicity of organic molecules, with applications in drug discovery, materials science, and environmental chemistry. Here, we present four tree-based machine learning models for pKa prediction of organic molecules. The four models, Random Forest (RF), Extra Trees (ExTr), Histogram Gradient Boosting (HGBoost), and Gradient Boosting (GBoost), were trained on an experimental pKa dataset and tested on SAMPL6 and SAMPL7, two external datasets. Structural and organic parameter (SPOC)-based descriptors were introduced to represent the physicochemical properties of molecules. Further molecular descriptors have been generated using density functional theory (DFT) calculations, and RDKit library. The model trained with the ExTr algorithm showed the best prediction performance with an overall mean absolute error (MAE) value of 1.41 pKa units. Our model (ExTr) outperforms selected models on a range of benchmark data, while offering two unique advantages: (1) full transparency (open descriptors and data) in contrast to proprietary black boxes, and (2) reduced computational cost compared to hybrid QM/ML approaches. While specialized tools like QupKake (MAE <span><math><mo>=</mo></math></span> 0.67) achieve better accuracy, our framework provides an open-source basis for interpretable pKa predictions, efficiently combining molecular physics and machine learning. This model represents a significant advancement in pKa prediction, offering a powerful tool for various applications in chemistry and beyond.</div></div>\",\"PeriodicalId\":72302,\"journal\":{\"name\":\"Artificial intelligence chemistry\",\"volume\":\"3 2\",\"pages\":\"Article 100092\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-08-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Artificial intelligence chemistry\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2949747725000090\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial intelligence chemistry","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949747725000090","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

对数酸解离常数pKa反映了一种化学物质的电离，它影响亲脂性、溶解度、蛋白质结合和穿过质膜的能力。它影响吸收、分布、代谢、排泄和毒性等化学性质。因此，准确预测pKa值对于理解和调节有机分子的酸碱度至关重要，并在药物发现、材料科学和环境化学中得到应用。在这里，我们提出了四种基于树的机器学习模型，用于有机分子的pKa预测。随机森林（Random Forest， RF）、额外树（Extra Trees, ExTr）、直方图梯度增强（Histogram Gradient Boosting, HGBoost）和梯度增强（Gradient Boosting, GBoost）四种模型在实验pKa数据集上进行训练，并在两个外部数据集SAMPL6和SAMPL7上进行测试。引入基于结构和有机参数（SPOC）的描述符来表示分子的物理化学性质。使用密度泛函理论（DFT）计算和RDKit库生成了进一步的分子描述符。用ExTr算法训练的模型预测效果最好，总体平均绝对误差（MAE）为1.41 pKa单位。我们的模型（ExTr）在一系列基准数据上优于所选模型，同时提供两个独特的优势：(1)与专有黑盒相比，完全透明（开放描述符和数据）；(2)与混合QM/ML方法相比，降低了计算成本。虽然像QupKake （MAE = 0.67）这样的专业工具实现了更好的准确性，但我们的框架为可解释的pKa预测提供了一个开源基础，有效地结合了分子物理和机器学习。该模型代表了pKa预测的重大进步，为化学和其他领域的各种应用提供了强大的工具。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Machine learning prediction of pKa of organic acids

The logarithmic acid dissociation constant pKa reflects the ionization of a chemical, which affects lipophilicity, solubility, protein binding, and ability to cross the plasma membrane. It affects the chemical properties of absorption, distribution, metabolism, excretion, and toxicity. Thus, accurate prediction of pKa values is crucial for understanding and modulating the acidity and basicity of organic molecules, with applications in drug discovery, materials science, and environmental chemistry. Here, we present four tree-based machine learning models for pKa prediction of organic molecules. The four models, Random Forest (RF), Extra Trees (ExTr), Histogram Gradient Boosting (HGBoost), and Gradient Boosting (GBoost), were trained on an experimental pKa dataset and tested on SAMPL6 and SAMPL7, two external datasets. Structural and organic parameter (SPOC)-based descriptors were introduced to represent the physicochemical properties of molecules. Further molecular descriptors have been generated using density functional theory (DFT) calculations, and RDKit library. The model trained with the ExTr algorithm showed the best prediction performance with an overall mean absolute error (MAE) value of 1.41 pKa units. Our model (ExTr) outperforms selected models on a range of benchmark data, while offering two unique advantages: (1) full transparency (open descriptors and data) in contrast to proprietary black boxes, and (2) reduced computational cost compared to hybrid QM/ML approaches. While specialized tools like QupKake (MAE

=

0.67) achieve better accuracy, our framework provides an open-source basis for interpretable pKa predictions, efficiently combining molecular physics and machine learning. This model represents a significant advancement in pKa prediction, offering a powerful tool for various applications in chemistry and beyond.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Artificial intelligence chemistry Chemistry (General)

自引率

0.00%

发文量

审稿时长

21 days