基于自监督学习和支架多样化的分子最大吸收波长预测数据不平衡缓解

IF 4.2 3区 工程技术 Q2 CHEMISTRY, APPLIED
Yong Wang, Peifu Han, Xue Li, Shuang Wang, Xun Wang, Tao Song
{"title":"基于自监督学习和支架多样化的分子最大吸收波长预测数据不平衡缓解","authors":"Yong Wang,&nbsp;Peifu Han,&nbsp;Xue Li,&nbsp;Shuang Wang,&nbsp;Xun Wang,&nbsp;Tao Song","doi":"10.1016/j.dyepig.2025.113287","DOIUrl":null,"url":null,"abstract":"<div><div>Near-infrared (NIR) absorbing dyes, especially those with maximum absorption wavelengths (<em>λ</em><sub>max</sub>) in the NIR region, show promising potential for phototherapy and bioimaging. However, their limited representation in open-source datasets hinders the development of data-driven predictive models. To address this issue, the NIRExDs dataset was constructed by manually supplementing existing open-source datasets with 2805 entries extracted from the literature, resulting in a 2.3-fold and 19.2-fold increase in the number of dyes in the NIR-I and NIR-II spectral regions, respectively. The self-supervised learning model Uni-Mol demonstrated superior performance in predicting <em>λ</em><sub>max</sub> compared to traditional supervised models, particularly in the long-tailed and sparsely distributed NIR-II region, with a more significant MAE reduction compared to other spectral regions. Analysis based on the prediction results revealed that models trained on scaffold-split datasets can expose limitations related to insufficient molecular scaffold diversity. Inspired by this finding, an optimized dataset, NIRExDs-1, was developed by adding a small number of structurally similar compounds to NIRExDs dataset. Models trained on NIRExDs-1 exhibited improved predictive performance on both internal and external test sets compared to those trained on the original NIRExDs dataset. Furthermore, scaffold split showed significantly weaker capability in capturing solvent effects than random split, primarily due to its inability to fully utilize measurement data of similar molecules across different solvent environments. Finally, Uni-Mol reproduced experimentally observed red-shift trends under classical molecular design strategies, confirming its ability to capture generalizable chemical rules.</div></div>","PeriodicalId":302,"journal":{"name":"Dyes and Pigments","volume":"246 ","pages":"Article 113287"},"PeriodicalIF":4.2000,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Mitigating dataset imbalance in molecular maximum absorption wavelength prediction via self-supervised learning and scaffold diversification\",\"authors\":\"Yong Wang,&nbsp;Peifu Han,&nbsp;Xue Li,&nbsp;Shuang Wang,&nbsp;Xun Wang,&nbsp;Tao Song\",\"doi\":\"10.1016/j.dyepig.2025.113287\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Near-infrared (NIR) absorbing dyes, especially those with maximum absorption wavelengths (<em>λ</em><sub>max</sub>) in the NIR region, show promising potential for phototherapy and bioimaging. However, their limited representation in open-source datasets hinders the development of data-driven predictive models. To address this issue, the NIRExDs dataset was constructed by manually supplementing existing open-source datasets with 2805 entries extracted from the literature, resulting in a 2.3-fold and 19.2-fold increase in the number of dyes in the NIR-I and NIR-II spectral regions, respectively. The self-supervised learning model Uni-Mol demonstrated superior performance in predicting <em>λ</em><sub>max</sub> compared to traditional supervised models, particularly in the long-tailed and sparsely distributed NIR-II region, with a more significant MAE reduction compared to other spectral regions. Analysis based on the prediction results revealed that models trained on scaffold-split datasets can expose limitations related to insufficient molecular scaffold diversity. Inspired by this finding, an optimized dataset, NIRExDs-1, was developed by adding a small number of structurally similar compounds to NIRExDs dataset. Models trained on NIRExDs-1 exhibited improved predictive performance on both internal and external test sets compared to those trained on the original NIRExDs dataset. Furthermore, scaffold split showed significantly weaker capability in capturing solvent effects than random split, primarily due to its inability to fully utilize measurement data of similar molecules across different solvent environments. Finally, Uni-Mol reproduced experimentally observed red-shift trends under classical molecular design strategies, confirming its ability to capture generalizable chemical rules.</div></div>\",\"PeriodicalId\":302,\"journal\":{\"name\":\"Dyes and Pigments\",\"volume\":\"246 \",\"pages\":\"Article 113287\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2025-09-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Dyes and Pigments\",\"FirstCategoryId\":\"88\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0143720825006576\",\"RegionNum\":3,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"CHEMISTRY, APPLIED\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Dyes and Pigments","FirstCategoryId":"88","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0143720825006576","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CHEMISTRY, APPLIED","Score":null,"Total":0}
引用次数: 0

摘要

近红外(NIR)吸收染料,特别是在近红外区域具有最大吸收波长(λmax)的染料,在光疗和生物成像方面具有广阔的应用前景。然而,它们在开源数据集中的有限表现阻碍了数据驱动预测模型的发展。为了解决这一问题,我们通过从文献中提取2805个条目来人工补充现有的开源数据集来构建NIRExDs数据集,结果NIR-I和NIR-II光谱区域的染料数量分别增加了2.3倍和19.2倍。与传统的监督学习模型相比,自监督学习模型Uni-Mol在预测λmax方面表现出了更好的性能,特别是在长尾和稀疏分布的NIR-II区域,与其他光谱区域相比,MAE的降低更为显著。基于预测结果的分析表明,在支架分裂数据集上训练的模型可能暴露出与分子支架多样性不足相关的局限性。受这一发现的启发,通过在NIRExDs数据集上添加少量结构相似的化合物,开发了一个优化的数据集NIRExDs-1。与在原始NIRExDs数据集上训练的模型相比,在NIRExDs-1上训练的模型在内部和外部测试集上都表现出更好的预测性能。此外,支架分裂在捕捉溶剂效应方面的能力明显弱于随机分裂,这主要是因为它无法充分利用不同溶剂环境下相似分子的测量数据。最后,Uni-Mol重现了经典分子设计策略下实验观察到的红移趋势,证实了其捕获可推广化学规则的能力。
本文章由计算机程序翻译,如有差异,请以英文原文为准。

Mitigating dataset imbalance in molecular maximum absorption wavelength prediction via self-supervised learning and scaffold diversification

Mitigating dataset imbalance in molecular maximum absorption wavelength prediction via self-supervised learning and scaffold diversification
Near-infrared (NIR) absorbing dyes, especially those with maximum absorption wavelengths (λmax) in the NIR region, show promising potential for phototherapy and bioimaging. However, their limited representation in open-source datasets hinders the development of data-driven predictive models. To address this issue, the NIRExDs dataset was constructed by manually supplementing existing open-source datasets with 2805 entries extracted from the literature, resulting in a 2.3-fold and 19.2-fold increase in the number of dyes in the NIR-I and NIR-II spectral regions, respectively. The self-supervised learning model Uni-Mol demonstrated superior performance in predicting λmax compared to traditional supervised models, particularly in the long-tailed and sparsely distributed NIR-II region, with a more significant MAE reduction compared to other spectral regions. Analysis based on the prediction results revealed that models trained on scaffold-split datasets can expose limitations related to insufficient molecular scaffold diversity. Inspired by this finding, an optimized dataset, NIRExDs-1, was developed by adding a small number of structurally similar compounds to NIRExDs dataset. Models trained on NIRExDs-1 exhibited improved predictive performance on both internal and external test sets compared to those trained on the original NIRExDs dataset. Furthermore, scaffold split showed significantly weaker capability in capturing solvent effects than random split, primarily due to its inability to fully utilize measurement data of similar molecules across different solvent environments. Finally, Uni-Mol reproduced experimentally observed red-shift trends under classical molecular design strategies, confirming its ability to capture generalizable chemical rules.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Dyes and Pigments
Dyes and Pigments 工程技术-材料科学:纺织
CiteScore
8.20
自引率
13.30%
发文量
933
审稿时长
33 days
期刊介绍: Dyes and Pigments covers the scientific and technical aspects of the chemistry and physics of dyes, pigments and their intermediates. Emphasis is placed on the properties of the colouring matters themselves rather than on their applications or the system in which they may be applied. Thus the journal accepts research and review papers on the synthesis of dyes, pigments and intermediates, their physical or chemical properties, e.g. spectroscopic, surface, solution or solid state characteristics, the physical aspects of their preparation, e.g. precipitation, nucleation and growth, crystal formation, liquid crystalline characteristics, their photochemical, ecological or biological properties and the relationship between colour and chemical constitution. However, papers are considered which deal with the more fundamental aspects of colourant application and of the interactions of colourants with substrates or media. The journal will interest a wide variety of workers in a range of disciplines whose work involves dyes, pigments and their intermediates, and provides a platform for investigators with common interests but diverse fields of activity such as cosmetics, reprographics, dye and pigment synthesis, medical research, polymers, etc.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信