{"title":"由SMILES异质编码器衍生的手性描述符的评估","authors":"Natalia Baimacheva, Xinyue Gao, Joao Aires-de-Sousa","doi":"10.1186/s13321-025-01080-7","DOIUrl":null,"url":null,"abstract":"<div><p>Molecular representations of chirality, derived from latent space vectors (LSVs) of SMILES heteroencoders, were explored to train machine learning models to predict chiral properties, and were compared to conventional circular fingerprints. Latent space arithmetic was applied to enhance the representation of chirality, by calculating differences between the original descriptor of a molecule and the descriptor of its enantiomer, or the difference between the original descriptor and the descriptor obtained with the stereochemistry-depleted SMILES string. Machine learning was performed with the Random Forest algorithm applied to a dataset of 3858 molecules extracted from the literature (1929 pairs of enantiomers) to predict the elution order observed on the Chiralpak® AD-H column, as well as intrinsic structural chirality labels (R/S or canonical SMILES @/@@). The descriptors derived from the heteroencoders achieved an accuracy of up to 0.75 in the prediction of the elution order, and the fingerprints were superior (0.82). A better predictive ability was observed with the difference LSV descriptors than with the original descriptors.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7000,"publicationDate":"2025-08-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01080-7","citationCount":"0","resultStr":"{\"title\":\"Evaluation of chirality descriptors derived from SMILES heteroencoders\",\"authors\":\"Natalia Baimacheva, Xinyue Gao, Joao Aires-de-Sousa\",\"doi\":\"10.1186/s13321-025-01080-7\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Molecular representations of chirality, derived from latent space vectors (LSVs) of SMILES heteroencoders, were explored to train machine learning models to predict chiral properties, and were compared to conventional circular fingerprints. Latent space arithmetic was applied to enhance the representation of chirality, by calculating differences between the original descriptor of a molecule and the descriptor of its enantiomer, or the difference between the original descriptor and the descriptor obtained with the stereochemistry-depleted SMILES string. Machine learning was performed with the Random Forest algorithm applied to a dataset of 3858 molecules extracted from the literature (1929 pairs of enantiomers) to predict the elution order observed on the Chiralpak® AD-H column, as well as intrinsic structural chirality labels (R/S or canonical SMILES @/@@). The descriptors derived from the heteroencoders achieved an accuracy of up to 0.75 in the prediction of the elution order, and the fingerprints were superior (0.82). A better predictive ability was observed with the difference LSV descriptors than with the original descriptors.</p></div>\",\"PeriodicalId\":617,\"journal\":{\"name\":\"Journal of Cheminformatics\",\"volume\":\"17 1\",\"pages\":\"\"},\"PeriodicalIF\":5.7000,\"publicationDate\":\"2025-08-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01080-7\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Cheminformatics\",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://link.springer.com/article/10.1186/s13321-025-01080-7\",\"RegionNum\":2,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Cheminformatics","FirstCategoryId":"92","ListUrlMain":"https://link.springer.com/article/10.1186/s13321-025-01080-7","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0
摘要
利用smile异质编码器的潜在空间向量(latent space vector, LSVs)对手性分子表征进行了探索,以训练机器学习模型来预测手性,并与传统圆形指纹进行了比较。通过计算分子的原始描述符与其对映体描述符之间的差异,或者原始描述符与用立体化学缺失的SMILES字符串得到的描述符之间的差异,应用潜在空间算法增强了手性的表示。使用随机森林算法对从文献中提取的3858个分子(1929对对映体)进行机器学习,以预测Chiralpak®AD-H柱上观察到的洗脱顺序,以及固有结构手性标签(R/S或规范SMILES @/@)。基于异质编码器的描述符对洗脱顺序的预测精度高达0.75,指纹图谱的预测精度为0.82。与原始描述符相比,不同的LSV描述符具有更好的预测能力。我们的工作提出了潜在空间算法来获得分子手性的描述符从SMILES异质编码器。我们利用这种分子表征建立了定量结构-对映体选择性关系,用于预测手性色谱中对映体的洗脱顺序,并与圆形指纹图谱的结果进行了比较。研究表明,相对对映体的δ描述子增强了潜在空间向量编码手性的能力。
Evaluation of chirality descriptors derived from SMILES heteroencoders
Molecular representations of chirality, derived from latent space vectors (LSVs) of SMILES heteroencoders, were explored to train machine learning models to predict chiral properties, and were compared to conventional circular fingerprints. Latent space arithmetic was applied to enhance the representation of chirality, by calculating differences between the original descriptor of a molecule and the descriptor of its enantiomer, or the difference between the original descriptor and the descriptor obtained with the stereochemistry-depleted SMILES string. Machine learning was performed with the Random Forest algorithm applied to a dataset of 3858 molecules extracted from the literature (1929 pairs of enantiomers) to predict the elution order observed on the Chiralpak® AD-H column, as well as intrinsic structural chirality labels (R/S or canonical SMILES @/@@). The descriptors derived from the heteroencoders achieved an accuracy of up to 0.75 in the prediction of the elution order, and the fingerprints were superior (0.82). A better predictive ability was observed with the difference LSV descriptors than with the original descriptors.
期刊介绍:
Journal of Cheminformatics is an open access journal publishing original peer-reviewed research in all aspects of cheminformatics and molecular modelling.
Coverage includes, but is not limited to:
chemical information systems, software and databases, and molecular modelling,
chemical structure representations and their use in structure, substructure, and similarity searching of chemical substance and chemical reaction databases,
computer and molecular graphics, computer-aided molecular design, expert systems, QSAR, and data mining techniques.