Arkadiusz Leniak, , , Wojciech Pietruś*, , and , Rafał Kurczab*,
{"title":"从核磁共振到人工智能:融合1H和13C表示增强QSPR建模。","authors":"Arkadiusz Leniak, , , Wojciech Pietruś*, , and , Rafał Kurczab*, ","doi":"10.1021/acs.jcim.5c01791","DOIUrl":null,"url":null,"abstract":"<p >The ability to predict log <i>D</i> directly from spectral patterns marks a conceptual shift in cheminformatics. In this work, we demonstrate that <sup>1</sup>H and <sup>13</sup>C NMR spectra, computationally generated from molecular structures and transformed into machine learning-compatible vectors, can approach and rival classical structure-based descriptors such as ECFP4 fingerprints in modeling the log <i>D</i> parameter. Through comprehensive benchmarking of nearly 70 models across seven algorithmic classes and three pH conditions, we show that concatenation of <sup>1</sup>H and <sup>13</sup>C NMR spectra offers the best trade-off between accuracy and efficiency. In the best case, a fused spectral CNN model achieved a root-mean-square error (RMSE) of 0.57 and a <i>Q</i><sup>2</sup> of 0.76 using a 400-dimensional input vector─closely matching the ECFP4 benchmark (RMSE 0.56, <i>Q</i><sup>2</sup> 0.78) despite being five times smaller. These findings challenge the assumption that descriptor richness must come at the cost of dimensional complexity. SHAP-based analysis revealed modality-specific patterns: <sup>13</sup>C regions linked to aromatic and carbonyl carbons (110–170 ppm) increased predicted log <i>D</i>, while <sup>1</sup>H signals associated with polar groups, including OH, NH, amides, and ethers (2–4.5 and ∼8 ppm), reduced it. This positions NMR-based vectors as both interpretable and scalable alternatives to conventional fingerprints. By releasing a standalone graphical prediction tool based on our models, we make this paradigm practically accessible for real-world applications. This study establishes <i>in silico-</i>generated NMR spectra as valid and powerful descriptors in predictive modeling, paving the way for spectrum-driven approaches to drug discovery and property prediction.</p>","PeriodicalId":44,"journal":{"name":"Journal of Chemical Information and Modeling ","volume":"65 19","pages":"10323–10337"},"PeriodicalIF":5.3000,"publicationDate":"2025-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://pubs.acs.org/doi/pdf/10.1021/acs.jcim.5c01791","citationCount":"0","resultStr":"{\"title\":\"From NMR to AI: Fusing 1H and 13C Representations for Enhanced QSPR Modeling\",\"authors\":\"Arkadiusz Leniak, , , Wojciech Pietruś*, , and , Rafał Kurczab*, \",\"doi\":\"10.1021/acs.jcim.5c01791\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p >The ability to predict log <i>D</i> directly from spectral patterns marks a conceptual shift in cheminformatics. In this work, we demonstrate that <sup>1</sup>H and <sup>13</sup>C NMR spectra, computationally generated from molecular structures and transformed into machine learning-compatible vectors, can approach and rival classical structure-based descriptors such as ECFP4 fingerprints in modeling the log <i>D</i> parameter. Through comprehensive benchmarking of nearly 70 models across seven algorithmic classes and three pH conditions, we show that concatenation of <sup>1</sup>H and <sup>13</sup>C NMR spectra offers the best trade-off between accuracy and efficiency. In the best case, a fused spectral CNN model achieved a root-mean-square error (RMSE) of 0.57 and a <i>Q</i><sup>2</sup> of 0.76 using a 400-dimensional input vector─closely matching the ECFP4 benchmark (RMSE 0.56, <i>Q</i><sup>2</sup> 0.78) despite being five times smaller. These findings challenge the assumption that descriptor richness must come at the cost of dimensional complexity. SHAP-based analysis revealed modality-specific patterns: <sup>13</sup>C regions linked to aromatic and carbonyl carbons (110–170 ppm) increased predicted log <i>D</i>, while <sup>1</sup>H signals associated with polar groups, including OH, NH, amides, and ethers (2–4.5 and ∼8 ppm), reduced it. This positions NMR-based vectors as both interpretable and scalable alternatives to conventional fingerprints. By releasing a standalone graphical prediction tool based on our models, we make this paradigm practically accessible for real-world applications. This study establishes <i>in silico-</i>generated NMR spectra as valid and powerful descriptors in predictive modeling, paving the way for spectrum-driven approaches to drug discovery and property prediction.</p>\",\"PeriodicalId\":44,\"journal\":{\"name\":\"Journal of Chemical Information and Modeling \",\"volume\":\"65 19\",\"pages\":\"10323–10337\"},\"PeriodicalIF\":5.3000,\"publicationDate\":\"2025-09-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://pubs.acs.org/doi/pdf/10.1021/acs.jcim.5c01791\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Chemical Information and Modeling \",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://pubs.acs.org/doi/10.1021/acs.jcim.5c01791\",\"RegionNum\":2,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, MEDICINAL\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Chemical Information and Modeling ","FirstCategoryId":"92","ListUrlMain":"https://pubs.acs.org/doi/10.1021/acs.jcim.5c01791","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MEDICINAL","Score":null,"Total":0}
From NMR to AI: Fusing 1H and 13C Representations for Enhanced QSPR Modeling
The ability to predict log D directly from spectral patterns marks a conceptual shift in cheminformatics. In this work, we demonstrate that 1H and 13C NMR spectra, computationally generated from molecular structures and transformed into machine learning-compatible vectors, can approach and rival classical structure-based descriptors such as ECFP4 fingerprints in modeling the log D parameter. Through comprehensive benchmarking of nearly 70 models across seven algorithmic classes and three pH conditions, we show that concatenation of 1H and 13C NMR spectra offers the best trade-off between accuracy and efficiency. In the best case, a fused spectral CNN model achieved a root-mean-square error (RMSE) of 0.57 and a Q2 of 0.76 using a 400-dimensional input vector─closely matching the ECFP4 benchmark (RMSE 0.56, Q2 0.78) despite being five times smaller. These findings challenge the assumption that descriptor richness must come at the cost of dimensional complexity. SHAP-based analysis revealed modality-specific patterns: 13C regions linked to aromatic and carbonyl carbons (110–170 ppm) increased predicted log D, while 1H signals associated with polar groups, including OH, NH, amides, and ethers (2–4.5 and ∼8 ppm), reduced it. This positions NMR-based vectors as both interpretable and scalable alternatives to conventional fingerprints. By releasing a standalone graphical prediction tool based on our models, we make this paradigm practically accessible for real-world applications. This study establishes in silico-generated NMR spectra as valid and powerful descriptors in predictive modeling, paving the way for spectrum-driven approaches to drug discovery and property prediction.
期刊介绍:
The Journal of Chemical Information and Modeling publishes papers reporting new methodology and/or important applications in the fields of chemical informatics and molecular modeling. Specific topics include the representation and computer-based searching of chemical databases, molecular modeling, computer-aided molecular design of new materials, catalysts, or ligands, development of new computational methods or efficient algorithms for chemical software, and biopharmaceutical chemistry including analyses of biological activity and other issues related to drug discovery.
Astute chemists, computer scientists, and information specialists look to this monthly’s insightful research studies, programming innovations, and software reviews to keep current with advances in this integral, multidisciplinary field.
As a subscriber you’ll stay abreast of database search systems, use of graph theory in chemical problems, substructure search systems, pattern recognition and clustering, analysis of chemical and physical data, molecular modeling, graphics and natural language interfaces, bibliometric and citation analysis, and synthesis design and reactions databases.