Yong Wang, Peifu Han, Xue Li, Shuang Wang, Xun Wang, Tao Song
{"title":"基于自监督学习和支架多样化的分子最大吸收波长预测数据不平衡缓解","authors":"Yong Wang, Peifu Han, Xue Li, Shuang Wang, Xun Wang, Tao Song","doi":"10.1016/j.dyepig.2025.113287","DOIUrl":null,"url":null,"abstract":"<div><div>Near-infrared (NIR) absorbing dyes, especially those with maximum absorption wavelengths (<em>λ</em><sub>max</sub>) in the NIR region, show promising potential for phototherapy and bioimaging. However, their limited representation in open-source datasets hinders the development of data-driven predictive models. To address this issue, the NIRExDs dataset was constructed by manually supplementing existing open-source datasets with 2805 entries extracted from the literature, resulting in a 2.3-fold and 19.2-fold increase in the number of dyes in the NIR-I and NIR-II spectral regions, respectively. The self-supervised learning model Uni-Mol demonstrated superior performance in predicting <em>λ</em><sub>max</sub> compared to traditional supervised models, particularly in the long-tailed and sparsely distributed NIR-II region, with a more significant MAE reduction compared to other spectral regions. Analysis based on the prediction results revealed that models trained on scaffold-split datasets can expose limitations related to insufficient molecular scaffold diversity. Inspired by this finding, an optimized dataset, NIRExDs-1, was developed by adding a small number of structurally similar compounds to NIRExDs dataset. Models trained on NIRExDs-1 exhibited improved predictive performance on both internal and external test sets compared to those trained on the original NIRExDs dataset. Furthermore, scaffold split showed significantly weaker capability in capturing solvent effects than random split, primarily due to its inability to fully utilize measurement data of similar molecules across different solvent environments. Finally, Uni-Mol reproduced experimentally observed red-shift trends under classical molecular design strategies, confirming its ability to capture generalizable chemical rules.</div></div>","PeriodicalId":302,"journal":{"name":"Dyes and Pigments","volume":"246 ","pages":"Article 113287"},"PeriodicalIF":4.2000,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Mitigating dataset imbalance in molecular maximum absorption wavelength prediction via self-supervised learning and scaffold diversification\",\"authors\":\"Yong Wang, Peifu Han, Xue Li, Shuang Wang, Xun Wang, Tao Song\",\"doi\":\"10.1016/j.dyepig.2025.113287\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Near-infrared (NIR) absorbing dyes, especially those with maximum absorption wavelengths (<em>λ</em><sub>max</sub>) in the NIR region, show promising potential for phototherapy and bioimaging. However, their limited representation in open-source datasets hinders the development of data-driven predictive models. To address this issue, the NIRExDs dataset was constructed by manually supplementing existing open-source datasets with 2805 entries extracted from the literature, resulting in a 2.3-fold and 19.2-fold increase in the number of dyes in the NIR-I and NIR-II spectral regions, respectively. The self-supervised learning model Uni-Mol demonstrated superior performance in predicting <em>λ</em><sub>max</sub> compared to traditional supervised models, particularly in the long-tailed and sparsely distributed NIR-II region, with a more significant MAE reduction compared to other spectral regions. Analysis based on the prediction results revealed that models trained on scaffold-split datasets can expose limitations related to insufficient molecular scaffold diversity. Inspired by this finding, an optimized dataset, NIRExDs-1, was developed by adding a small number of structurally similar compounds to NIRExDs dataset. Models trained on NIRExDs-1 exhibited improved predictive performance on both internal and external test sets compared to those trained on the original NIRExDs dataset. Furthermore, scaffold split showed significantly weaker capability in capturing solvent effects than random split, primarily due to its inability to fully utilize measurement data of similar molecules across different solvent environments. Finally, Uni-Mol reproduced experimentally observed red-shift trends under classical molecular design strategies, confirming its ability to capture generalizable chemical rules.</div></div>\",\"PeriodicalId\":302,\"journal\":{\"name\":\"Dyes and Pigments\",\"volume\":\"246 \",\"pages\":\"Article 113287\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2025-09-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Dyes and Pigments\",\"FirstCategoryId\":\"88\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0143720825006576\",\"RegionNum\":3,\"RegionCategory\":\"工程技术\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"CHEMISTRY, APPLIED\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Dyes and Pigments","FirstCategoryId":"88","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0143720825006576","RegionNum":3,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"CHEMISTRY, APPLIED","Score":null,"Total":0}
Mitigating dataset imbalance in molecular maximum absorption wavelength prediction via self-supervised learning and scaffold diversification
Near-infrared (NIR) absorbing dyes, especially those with maximum absorption wavelengths (λmax) in the NIR region, show promising potential for phototherapy and bioimaging. However, their limited representation in open-source datasets hinders the development of data-driven predictive models. To address this issue, the NIRExDs dataset was constructed by manually supplementing existing open-source datasets with 2805 entries extracted from the literature, resulting in a 2.3-fold and 19.2-fold increase in the number of dyes in the NIR-I and NIR-II spectral regions, respectively. The self-supervised learning model Uni-Mol demonstrated superior performance in predicting λmax compared to traditional supervised models, particularly in the long-tailed and sparsely distributed NIR-II region, with a more significant MAE reduction compared to other spectral regions. Analysis based on the prediction results revealed that models trained on scaffold-split datasets can expose limitations related to insufficient molecular scaffold diversity. Inspired by this finding, an optimized dataset, NIRExDs-1, was developed by adding a small number of structurally similar compounds to NIRExDs dataset. Models trained on NIRExDs-1 exhibited improved predictive performance on both internal and external test sets compared to those trained on the original NIRExDs dataset. Furthermore, scaffold split showed significantly weaker capability in capturing solvent effects than random split, primarily due to its inability to fully utilize measurement data of similar molecules across different solvent environments. Finally, Uni-Mol reproduced experimentally observed red-shift trends under classical molecular design strategies, confirming its ability to capture generalizable chemical rules.
期刊介绍:
Dyes and Pigments covers the scientific and technical aspects of the chemistry and physics of dyes, pigments and their intermediates. Emphasis is placed on the properties of the colouring matters themselves rather than on their applications or the system in which they may be applied.
Thus the journal accepts research and review papers on the synthesis of dyes, pigments and intermediates, their physical or chemical properties, e.g. spectroscopic, surface, solution or solid state characteristics, the physical aspects of their preparation, e.g. precipitation, nucleation and growth, crystal formation, liquid crystalline characteristics, their photochemical, ecological or biological properties and the relationship between colour and chemical constitution. However, papers are considered which deal with the more fundamental aspects of colourant application and of the interactions of colourants with substrates or media.
The journal will interest a wide variety of workers in a range of disciplines whose work involves dyes, pigments and their intermediates, and provides a platform for investigators with common interests but diverse fields of activity such as cosmetics, reprographics, dye and pigment synthesis, medical research, polymers, etc.