Enhancing predictive models for solubility in multicomponent solvent systems using semi-supervised graph neural networks†

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY
Hojin Jung, Christopher D. Stubbs, Sabari Kumar, Raúl Pérez-Soto, Su-min Song, Yeonjoon Kim and Seonah Kim
{"title":"Enhancing predictive models for solubility in multicomponent solvent systems using semi-supervised graph neural networks†","authors":"Hojin Jung, Christopher D. Stubbs, Sabari Kumar, Raúl Pérez-Soto, Su-min Song, Yeonjoon Kim and Seonah Kim","doi":"10.1039/D5DD00015G","DOIUrl":null,"url":null,"abstract":"<p >Solubility plays a critical role in guiding molecular design, reaction optimization, and product formulation across diverse chemical applications. Despite its importance, current approaches for measuring solubility face significant challenges, including time- and resource-intensive experiments and limited applicability to novel compounds. Computational prediction strategies, ranging from theoretical models to machine learning (ML) based methods, offer promising pathways to address these challenges. However, such methodologies need further improvement to achieve accurate predictions of solubilities in multicomponent solvent systems, as expanding the modeling approaches to multicomponent mixtures enables broader practical applications in chemistry. This study focuses on modeling solubility in multicomponent solvent systems, where data scarcity and model generalizability remain key hurdles. We curated a comprehensive experimental solubility dataset (MixSolDB) and examined two graph neural network (GNN) architectures – concatenation and subgraph – for improved predictive performance. By further integrating computationally derived COSMO-RS data <em>via</em> a teacher–student semi-supervised distillation (SSD) framework, we significantly expanded the chemical space and corrected previously high error margins. These results illustrate the feasibility of unifying experimental and computational data in a robust, flexible GNN-SSD pipeline, enabling greater coverage, improved accuracy, and enhanced applicability of solubility models for complex multicomponent solvent systems.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 6","pages":" 1492-1504"},"PeriodicalIF":6.2000,"publicationDate":"2025-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d5dd00015g?page=search","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital discovery","FirstCategoryId":"1085","ListUrlMain":"https://pubs.rsc.org/en/content/articlelanding/2025/dd/d5dd00015g","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

Abstract

Solubility plays a critical role in guiding molecular design, reaction optimization, and product formulation across diverse chemical applications. Despite its importance, current approaches for measuring solubility face significant challenges, including time- and resource-intensive experiments and limited applicability to novel compounds. Computational prediction strategies, ranging from theoretical models to machine learning (ML) based methods, offer promising pathways to address these challenges. However, such methodologies need further improvement to achieve accurate predictions of solubilities in multicomponent solvent systems, as expanding the modeling approaches to multicomponent mixtures enables broader practical applications in chemistry. This study focuses on modeling solubility in multicomponent solvent systems, where data scarcity and model generalizability remain key hurdles. We curated a comprehensive experimental solubility dataset (MixSolDB) and examined two graph neural network (GNN) architectures – concatenation and subgraph – for improved predictive performance. By further integrating computationally derived COSMO-RS data via a teacher–student semi-supervised distillation (SSD) framework, we significantly expanded the chemical space and corrected previously high error margins. These results illustrate the feasibility of unifying experimental and computational data in a robust, flexible GNN-SSD pipeline, enabling greater coverage, improved accuracy, and enhanced applicability of solubility models for complex multicomponent solvent systems.

Abstract Image

利用半监督图神经网络增强多组分溶剂系统溶解度的预测模型
溶解度在指导分子设计、反应优化和不同化学应用的产品配方中起着关键作用。尽管它很重要,但目前测量溶解度的方法面临着重大挑战,包括时间和资源密集的实验以及对新化合物的有限适用性。计算预测策略,从理论模型到基于机器学习(ML)的方法,为解决这些挑战提供了有希望的途径。然而,这些方法需要进一步改进,以实现对多组分溶剂系统溶解度的准确预测,因为将建模方法扩展到多组分混合物可以在化学中更广泛的实际应用。本研究的重点是在多组分溶剂系统中建模溶解度,其中数据稀缺性和模型可泛化性仍然是主要障碍。我们策划了一个全面的实验溶解度数据集(MixSolDB),并检查了两种图神经网络(GNN)架构——串联和子图——以提高预测性能。通过师生半监督蒸馏(SSD)框架进一步整合计算得出的cosmos - rs数据,我们大大扩展了化学空间,并纠正了之前的高误差范围。这些结果表明,将实验数据和计算数据统一到一个强大、灵活的GNN-SSD管道中是可行的,可以为复杂的多组分溶剂系统提供更大的覆盖范围、更高的精度和更强的溶解度模型的适用性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
2.80
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信