The openOCHEM consensus model is the best-performing open-source predictive model in the First EUOS/SLAS joint compound solubility challenge

IF 4.6 Q2 MATERIALS SCIENCE, BIOMATERIALS
Andrea Hunklinger , Peter Hartog , Martin Šícho , Guillaume Godin , Igor V. Tetko
{"title":"The openOCHEM consensus model is the best-performing open-source predictive model in the First EUOS/SLAS joint compound solubility challenge","authors":"Andrea Hunklinger ,&nbsp;Peter Hartog ,&nbsp;Martin Šícho ,&nbsp;Guillaume Godin ,&nbsp;Igor V. Tetko","doi":"10.1016/j.slasd.2024.01.005","DOIUrl":null,"url":null,"abstract":"<div><p>The EUOS/SLAS challenge aimed to facilitate the development of reliable algorithms to predict the aqueous solubility of small molecules using experimental data from 100 K compounds. In total, hundred teams took part in the challenge to predict low, medium and highly soluble compounds as measured by the nephelometry assay. This article describes the winning model, which was developed using the publicly available Online CHEmical database and Modeling environment (OCHEM) available on the website <span>https://ochem.eu/article/27</span><svg><path></path></svg>. We describe in detail the assumptions and steps used to select methods, descriptors and strategy which contributed to the winning solution. In particular we show that consensus based on 28 models calculated using descriptor-based and representation learning methods allowed us to obtain the best score, which was higher than those based on individual approaches or consensus models developed using each individual approach. A combination of diverse models allowed us to decrease both bias and variance of individual models and to calculate the highest score. The model based on Transformer CNN contributed the best individual score thus highlighting the power of Natural Language Processing (NLP) methods. The inclusion of information about aleatoric uncertainty would be important to better understand and use the challenge data by the contestants.</p></div>","PeriodicalId":2,"journal":{"name":"ACS Applied Bio Materials","volume":null,"pages":null},"PeriodicalIF":4.6000,"publicationDate":"2024-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2472555224000066/pdfft?md5=6b7aa512858162a77178db862a6715d1&pid=1-s2.0-S2472555224000066-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACS Applied Bio Materials","FirstCategoryId":"99","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2472555224000066","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MATERIALS SCIENCE, BIOMATERIALS","Score":null,"Total":0}
引用次数: 0

Abstract

The EUOS/SLAS challenge aimed to facilitate the development of reliable algorithms to predict the aqueous solubility of small molecules using experimental data from 100 K compounds. In total, hundred teams took part in the challenge to predict low, medium and highly soluble compounds as measured by the nephelometry assay. This article describes the winning model, which was developed using the publicly available Online CHEmical database and Modeling environment (OCHEM) available on the website https://ochem.eu/article/27. We describe in detail the assumptions and steps used to select methods, descriptors and strategy which contributed to the winning solution. In particular we show that consensus based on 28 models calculated using descriptor-based and representation learning methods allowed us to obtain the best score, which was higher than those based on individual approaches or consensus models developed using each individual approach. A combination of diverse models allowed us to decrease both bias and variance of individual models and to calculate the highest score. The model based on Transformer CNN contributed the best individual score thus highlighting the power of Natural Language Processing (NLP) methods. The inclusion of information about aleatoric uncertainty would be important to better understand and use the challenge data by the contestants.

Abstract Image

Abstract Image

openOCHEM 共识模型是第一届 EUOS/SLAS 联合化合物溶解度挑战赛中表现最佳的开源预测模型
EUOS/SLAS 挑战赛旨在促进可靠算法的开发,利用 100K 种化合物的实验数据预测小分子的水溶性。共有上百个团队参加了这项挑战赛,通过肾浊度测定法预测低、中和高溶解度的化合物。本文介绍了获胜的模型,该模型是利用 https://ochem.eu/article/27 网站上公开的在线化学物质数据库和建模环境(OCHEM)开发的。我们详细描述了选择方法、描述符和策略时所使用的假设和步骤,这些假设和步骤促成了优胜方案的产生。我们特别指出,基于使用描述符和表征学习方法计算出的 28 个模型的共识使我们获得了最佳分数,该分数高于基于单个方法或使用每个单个方法开发的共识模型的分数。不同模型的组合使我们能够减少单个模型的偏差和方差,并计算出最高得分。基于变换器 CNN 的模型贡献了最佳的单项得分,从而彰显了自然语言处理(NLP)方法的威力。要想让参赛者更好地理解和使用挑战赛数据,加入有关不确定性的信息非常重要。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
ACS Applied Bio Materials
ACS Applied Bio Materials Chemistry-Chemistry (all)
CiteScore
9.40
自引率
2.10%
发文量
464
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信