SurfPro – a curated database and predictive model of experimental properties of surfactants†

IF 6.2 Q1 CHEMISTRY, MULTIDISCIPLINARY
Stefan L. Hödl, Luc Hermans, Pim F. J. Dankloff, Aigars Piruska, Wilhelm T. S. Huck and William E. Robinson
{"title":"SurfPro – a curated database and predictive model of experimental properties of surfactants†","authors":"Stefan L. Hödl, Luc Hermans, Pim F. J. Dankloff, Aigars Piruska, Wilhelm T. S. Huck and William E. Robinson","doi":"10.1039/D4DD00393D","DOIUrl":null,"url":null,"abstract":"<p >Despite great industrial interest, modeling the physical properties of surfactants in water based on their molecular structure remains a challenge. A significant part of this challenge is in obtaining sufficient amounts of high-quality data. Experimentally determined properties such the critical micelle concentration (CMC) and surface tension at CMC (<em>γ</em><small><sub>CMC</sub></small>) have been reported for many surfactants. However, surfactant data are scattered across many literature sources, and reported in a manner which is often unsuitable as input for predictive models. In this work, we address this limitation by compiling the SurfPro database of surfactant properties. SurfPro consists of 1624 surfactant entries curated from 223 literature sources, containing 1395 CMC values, 972 <em>γ</em><small><sub>CMC</sub></small> values and more than 657 values for <em>Γ</em><small><sub>max</sub></small>, <em>C</em><small><sub>20</sub></small>, π<small><sub>CMC</sub></small> and <em>A</em><small><sub>min</sub></small>. However, only 647 structures have all reported properties, and for most surfactants multiple properties are missing. We trained a previously reported graph neural network architecture for single- and multi-property prediction on these incomplete data of all surfactant types in the database to accurately predict pCMC (−log<small><sub>10</sub></small>(CMC)), <em>γ</em><small><sub>CMC</sub></small>, <em>Γ</em><small><sub>max</sub></small> and p<em>C</em><small><sub>20</sub></small>. We achieved state-of-the-art performance of these four properties using an ensemble of AttentiveFP models trained on ten different folds of the training data in the multi-property setting. Finally, we leveraged the predictions and uncertainties of the ensemble model to impute all missing properties for all 977 surfactants with an incomplete set of properties. We make our curated SurfPro database, proposed test split and training datasets, the imputed database, as well as our code publicly available.</p>","PeriodicalId":72816,"journal":{"name":"Digital discovery","volume":" 5","pages":" 1176-1187"},"PeriodicalIF":6.2000,"publicationDate":"2025-03-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.rsc.org/en/content/articlepdf/2025/dd/d4dd00393d?page=search","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Digital discovery","FirstCategoryId":"1085","ListUrlMain":"https://pubs.rsc.org/en/content/articlelanding/2025/dd/d4dd00393d","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

Abstract

Despite great industrial interest, modeling the physical properties of surfactants in water based on their molecular structure remains a challenge. A significant part of this challenge is in obtaining sufficient amounts of high-quality data. Experimentally determined properties such the critical micelle concentration (CMC) and surface tension at CMC (γCMC) have been reported for many surfactants. However, surfactant data are scattered across many literature sources, and reported in a manner which is often unsuitable as input for predictive models. In this work, we address this limitation by compiling the SurfPro database of surfactant properties. SurfPro consists of 1624 surfactant entries curated from 223 literature sources, containing 1395 CMC values, 972 γCMC values and more than 657 values for Γmax, C20, πCMC and Amin. However, only 647 structures have all reported properties, and for most surfactants multiple properties are missing. We trained a previously reported graph neural network architecture for single- and multi-property prediction on these incomplete data of all surfactant types in the database to accurately predict pCMC (−log10(CMC)), γCMC, Γmax and pC20. We achieved state-of-the-art performance of these four properties using an ensemble of AttentiveFP models trained on ten different folds of the training data in the multi-property setting. Finally, we leveraged the predictions and uncertainties of the ensemble model to impute all missing properties for all 977 surfactants with an incomplete set of properties. We make our curated SurfPro database, proposed test split and training datasets, the imputed database, as well as our code publicly available.

SurfPro -一个策划数据库和预测模型的实验性质的表面活性剂†
尽管有很大的工业兴趣,基于分子结构对水中表面活性剂的物理性质进行建模仍然是一个挑战。这一挑战的一个重要部分是获得足够数量的高质量数据。许多表面活性剂的临界胶束浓度(CMC)和表面张力(γ - CMC)等性质已被实验测定。然而,表面活性剂数据分散在许多文献来源中,并且以一种通常不适合作为预测模型输入的方式进行报道。在这项工作中,我们通过编译SurfPro表面活性剂性质数据库来解决这一限制。SurfPro收录了223篇文献中的1624个表面活性剂条目,其中含有1395个CMC值,972个γCMC值,以及Γmax、C20、πCMC和Amin的657个以上的值。然而,只有647种结构具有所有已报道的性质,而大多数表面活性剂缺乏多种性质。我们在数据库中所有表面活性剂类型的这些不完整数据上训练了先前报道的用于单属性和多属性预测的图神经网络架构,以准确预测pCMC(−log10(CMC)), γCMC, Γmax和pC20。我们使用在多属性设置中训练数据的十个不同折叠上训练的AttentiveFP模型的集合,实现了这四个属性的最先进性能。最后,我们利用集合模型的预测和不确定性,对所有977种具有不完整性质集的表面活性剂的所有缺失性质进行了推算。我们将我们精心策划的SurfPro数据库、建议的测试分割和训练数据集、估算的数据库以及我们的代码公开提供。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
2.80
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信