Mehrsa Mardikoraem, Joelle N. Eaves, Theodore Belecciu, Nathaniel Pascual, Alexander Aljets, Bruno Hagenbuch, Erik M. Shapiro, Benjamin J. Orlando, Daniel R. Woldring
{"title":"Predicting inhibitors of OATP1B1 via heterogeneous OATP-ligand interaction graph neural network (HOLIgraph)","authors":"Mehrsa Mardikoraem, Joelle N. Eaves, Theodore Belecciu, Nathaniel Pascual, Alexander Aljets, Bruno Hagenbuch, Erik M. Shapiro, Benjamin J. Orlando, Daniel R. Woldring","doi":"10.1186/s13321-025-01020-5","DOIUrl":"10.1186/s13321-025-01020-5","url":null,"abstract":"<div><p>Organic anion transporting polypeptides (OATPs) are membrane transporters crucial for drug uptake and distribution in the human body. OATPs can mediate drug-drug interactions (DDIs) in which the interaction of one drug with an OATP impairs the uptake of another drug, resulting in potentially fatal pharmacological effects. Predicting OATP-mediated DDIs is challenging, due to limited information on OATP inhibition mechanisms and inconsistent experimental OATP inhibition data across different studies. This study introduces Heterogeneous OATP-Ligand Interaction Graph Neural Network (HOLIgraph), a novel computational model that integrates molecular modeling with a graph neural network to enhance the prediction of drug-induced OATP inhibition. By combining ligand (i.e., drug) molecular features with protein-ligand interaction data from rigorous docking simulations, HOLIgraph outperforms traditional DDI prediction models which rely solely on ligand molecular features. HOLIgraph achieved a median balanced accuracy of over 90 percent when predicting inhibitors for OATP1B1, significantly outperforming purely ligand-based models. Beyond improving inhibition prediction, the data used to train HOLIgraph can enable the characterization of protein residues involved in inhibitory drug-OATP interactions. We identified certain OATP1B1 residues that preferentially interact with inhibitors, including I46 and K49. We anticipate such interaction information will be valuable to future structural and mechanistic investigations of OATP1B1.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01020-5","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143908853","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Application of 3D atom pair map in an attention model for enhanced drug virtual screening","authors":"Gina Ryu, Wankyu Kim","doi":"10.1186/s13321-025-01023-2","DOIUrl":"10.1186/s13321-025-01023-2","url":null,"abstract":"<p>This study demonstrates the utility of a novel molecular representation, 3D APM and a deep learning model based on it for virtual screening, suggesting that many other prediction models would also benefit from adopting APM. An open-source script to generate 3D APM is available at https://github.com/rimeless/APM</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01023-2","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143908751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Prediction of blood–brain barrier and Caco-2 permeability through the Enalos Cloud Platform: combining contrastive learning and atom-attention message passing neural networks","authors":"Nikoletta-Maria Koutroumpa, Andreas Tsoumanis, Haralambos Sarimveis, Iseult Lynch, Georgia Melagraki, Antreas Afantitis","doi":"10.1186/s13321-025-01007-2","DOIUrl":"10.1186/s13321-025-01007-2","url":null,"abstract":"<div><p>In this study, we introduce a novel approach for predicting two key drug properties, blood–brain barrier (BBB) permeability and human intestinal absorption via Caco-2 permeability. Our methodology centers around a specialized neural network, the atom transformer-based Message Passing Neural Network (MPNN), which we have combined with contrastive learning techniques to enhance the process of representing and embedding molecular structures for more accurate property prediction. These innovative models focus on predicting BBB and Caco-2 permeability -two critical factors in drug absorption and distribution- which fall under the broader scope of ADMET (absorption, distribution, metabolism, excretion, and toxicity) properties. The models are readily accessible online through the Enalos Cloud Platform which offers a user-friendly, AI-powered, ready-to-use web service that significantly streamlines the drug design process, enabling users to easily predict and understand the behavior of potential drug compounds within the human body.</p><p><b>Scientific Contribution</b> Our study combines an atom-attention Message Passing Neural Network (AA-MPNN) with contrastive learning (CL), which significantly improves predictive accuracy. Our model leverages self-supervised learning to expand the chemical space used in training and self-attention mechanisms to focus on critical molecular features, enhancing both model accuracy and interpretability. Additionally, the ready-to-use web service based on our model democratizes access to predictive tools for the scientific and regulatory communities.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01007-2","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143904854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kirill E. Medvedev, R. Dustin Schaeffer, Nick V. Grishin
{"title":"Leveraging AI to explore structural contexts of post-translational modifications in drug binding","authors":"Kirill E. Medvedev, R. Dustin Schaeffer, Nick V. Grishin","doi":"10.1186/s13321-025-01019-y","DOIUrl":"10.1186/s13321-025-01019-y","url":null,"abstract":"<div><p>Post-translational modifications (PTMs) play a crucial role in allowing cells to expand the functionality of their proteins and adaptively regulate their signaling pathways. Defects in PTMs have been linked to numerous developmental disorders and human diseases, including cancer, diabetes, heart, neurodegenerative and metabolic diseases. PTMs are important targets in drug discovery, as they can significantly influence various aspects of drug interactions including binding affinity. The structural consequences of PTMs, such as phosphorylation-induced conformational changes or their effects on ligand binding affinity, have historically been challenging to study on a large scale, primarily due to reliance on experimental methods. Recent advancements in computational power and artificial intelligence, particularly in deep learning algorithms and protein structure prediction tools like AlphaFold3, have opened new possibilities for exploring the structural context of interactions between PTMs and drugs. These AI-driven methods enable accurate modeling of protein structures including prediction of PTM-modified regions and simulation of ligand-binding dynamics on a large scale. In this work, we identified small molecule binding-associated PTMs that can influence drug binding across all human proteins listed as small molecule targets in the DrugDomain database, which we developed recently. 6,131 identified PTMs were mapped to structural domains from Evolutionary Classification of Protein Domains (ECOD) database.</p><p><b>Scientific contribution</b>: Using recent AI-based approaches for protein structure prediction (AlphaFold3, RoseTTAFold All-Atom, Chai-1), we generated 14,178 models of PTM-modified human proteins with docked ligands. Our results demonstrate that these methods can predict PTM effects on small molecule binding, but precise evaluation of their accuracy requires a much larger benchmarking set. We also found that phosphorylation of NADPH-Cytochrome P450 Reductase, observed in cervical and lung cancer, causes significant structural disruption in the binding pocket, potentially impairing protein function. All data and generated models are available from DrugDomain database v1.1 (http://prodata.swmed.edu/DrugDomain/) and GitHub (https://github.com/kirmedvedev/DrugDomain). This resource is the first to our knowledge in offering structural context for small molecule binding-associated PTMs on a large scale.</p><h3>Graphical abstract</h3>\u0000<div><figure><div><div><picture><source><img></source></picture></div></div></figure></div></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-05-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01019-y","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143904752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving the accuracy of prediction models for small datasets of Cytochrome P450 inhibition with deep learning","authors":"Elpri Eka Permadi, Reiko Watanabe, Kenji Mizuguchi","doi":"10.1186/s13321-025-01015-2","DOIUrl":"10.1186/s13321-025-01015-2","url":null,"abstract":"<div><p>The cytochrome P450 (CYP) superfamily metabolises a wide range of compounds; however, drug-induced CYP inhibition can lead to adverse interactions. Identifying potential CYP inhibitors is crucial for safe drug administration. This study investigated the application of deep learning techniques to the prediction of CYP inhibition, focusing on the challenges posed by limited datasets for CYP2B6 and CYP2C8 isoforms. To tackle these limitations, we leveraged larger datasets for related CYP isoforms, compiling comprehensive data from public databases containing IC50 values for 12,369 compounds that target seven CYP isoforms. We constructed single-task, fine-tuning, multitask, and multitask models incorporating data imputation on the missing values. Notably, the multitask models with data imputation demonstrated significant improvement in CYP inhibition prediction over the single-task models. Using the most accurate prediction models, we evaluated the inhibitory activity of approved drugs against CYP2B6 and CYP2C8. Among the 1,808 approved drugs analysed, our multitask models with data imputation identified 161 and 154 potential inhibitors of CYP2B6 and CYP2C8, respectively. This study underscores the significant potential of multitask deep learning, particularly when utilising a graph convolutional network with data imputation, to enhance the accuracy of CYP inhibition predictions under the conditions of limited data availability.</p><p><b>Scientific contribution</b></p><p>This study demonstrates that even with small datasets, accurate prediction models can be constructed by utilising related data effectively. Also, our imputation techniques on the missing values improved the prediction accuracy of CYP2B6 and CYP2C8 inhibition significantly.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-04-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01015-2","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143888761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Konstantin Ushenin, Kuzma Khrabrov, Artem Tsypin, Anton Ber, Egor Rumiantsev, Artur Kadurin
{"title":"LAGNet: better electron density prediction for LCAO-based data and drug-like substances","authors":"Konstantin Ushenin, Kuzma Khrabrov, Artem Tsypin, Anton Ber, Egor Rumiantsev, Artur Kadurin","doi":"10.1186/s13321-025-01010-7","DOIUrl":"10.1186/s13321-025-01010-7","url":null,"abstract":"<div><p>The electron density is an important object in quantum chemistry that is crucial for many downstream tasks in drug design. Recent deep learning approaches predict the electron density around a molecule from atom types and atom positions. Most of these methods use the plane wave (PW) numerical method as a source of ground-truth training data. However, the drug design field mostly uses the Linear Combination of Atomic Orbitals (LCAO) for computation of quantum properties. In this study, we focus on prediction of the electron density for drug-like substances and training neural networks with LCAO-based datasets. Our experiments show that proper handling of large amplitudes of core orbitals is crucial for training on LCAO-based data. We propose to store the electron density with the standard grids instead of the uniform grid. This allowed us to reduce the number of probing points per molecule by 43 times and reduce storage space requirements by 8 times. Finally, we propose a novel architecture based on the DeepDFT model that we name LAGNet. It is specifically designed and tuned for drug-like substances and <span>(nabla ^2)</span>DFT dataset.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01010-7","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143884591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vincenzo Palmacci, Yasmine Nahal, Matthias Welsch, Ola Engkvist, Samuel Kaski, Johannes Kirchmair
{"title":"E-GuARD: expert-guided augmentation for the robust detection of compounds interfering with biological assays","authors":"Vincenzo Palmacci, Yasmine Nahal, Matthias Welsch, Ola Engkvist, Samuel Kaski, Johannes Kirchmair","doi":"10.1186/s13321-025-01014-3","DOIUrl":"10.1186/s13321-025-01014-3","url":null,"abstract":"<p>Assay interference caused by small organic compounds continues to pose formidable challenges to early drug discovery. Various computational methods have been developed to identify compounds likely to cause assay interference. However, due to the scarcity of data available for model development, the predictive accuracy and applicability of these approaches are limited. In this work, we present E-GuARD, a novel framework seeking to address data scarcity and imbalance by integrating self-distillation, active learning, and expert-guided molecular generation. E-GuARD iteratively enriches the training data with interference-relevant molecules, resulting in quantitative structure-interference relationship (QSIR) models with superior performance. We demonstrate the utility of E-GuARD with the examples of four high-quality data sets on thiol reactivity, redox reactivity, nanoluciferase inhibition, and firefly luciferase inhibition. Our models reached MCC values of up to 0.47 for these data sets, with two-fold or higher improvements in enrichment factors compared to models trained without E-GuARD data augmentation. These results highlight the potential of E-GuARD as a scalable solution to mitigating assay interference in early drug discovery.</p><p>We present E-GuARD, an innovative framework that combines iterative self-distillation with guided molecular augmentation to enhance the predictive performance of QSAR models. By allowing models to learn from newly generated, informative compounds through iterations, E-GuARD facilitates the understanding of underrepresented structural patterns and improves performance on unseen data. When applied across different interference mechanisms, E-GuARD consistently outperformed standard approaches. E-GuARD establishes the foundation for further research into dynamic data enrichment and more robust molecular modeling.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01014-3","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143884315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Radek Halfar, Jiří Damborský, Sérgio M. Marques, Jan Martinovič
{"title":"Moldina: a fast and accurate search algorithm for simultaneous docking of multiple ligands","authors":"Radek Halfar, Jiří Damborský, Sérgio M. Marques, Jan Martinovič","doi":"10.1186/s13321-025-01005-4","DOIUrl":"10.1186/s13321-025-01005-4","url":null,"abstract":"<div><p>Protein-ligand docking is a computational method routinely used in many structural biology applications. It usually involves one receptor and one ligand. The docking of multiple ligands, however, can be important in several situations, such as the study of synergistic effects, substrate and product inhibition, or competitive binding. This can be a challenging and computationally demanding process. By integrating Particle Swarm Optimization into the established AutoDock Vina framework, we provided a powerful tool capable of accelerating drug discovery, and computational enzymology. Here we present Moldina (Multiple-Ligand Molecular Docking over AutoDock Vina), a new algorithm built upon AutoDock Vina. Through comprehensive testing against AutoDock Vina, the algorithm exhibited comparable accuracy in predicting ligand binding conformations while significantly reducing the computational time up to several hundred times. Moldina and the benchmark data are freely available at https://opencode.it4i.eu/permed/moldina-multiple-ligand-molecular-docking-over-autodock-vina and https://github.com/It4innovations/moldina-multiple-ligand-molecular-docking-over-autodock-vina.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01005-4","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143880345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sarah Szwarc, Adriano Rutz, Kyungha Lee, Yassine Mejri, Olivier Bonnet, Hazrina Hazni, Adrien Jagora, Rany B. Mbeng Obame, Jin Kyoung Noh, Elvis Otogo N’Nang, Stephenie C. Alaribe, Khalijah Awang, Guillaume Bernadat, Young Hae Choi, Vincent Courdavault, Michel Frederich, Thomas Gaslonde, Florian Huber, Toh-Seok Kam, Yun Yee Low, Erwan Poupon, Justin J. J. van der Hooft, Kyo Bin Kang, Pierre Le Pogam, Mehdi A. Beniddir
{"title":"Translating community-wide spectral library into actionable chemical knowledge: a proof of concept with monoterpene indole alkaloids","authors":"Sarah Szwarc, Adriano Rutz, Kyungha Lee, Yassine Mejri, Olivier Bonnet, Hazrina Hazni, Adrien Jagora, Rany B. Mbeng Obame, Jin Kyoung Noh, Elvis Otogo N’Nang, Stephenie C. Alaribe, Khalijah Awang, Guillaume Bernadat, Young Hae Choi, Vincent Courdavault, Michel Frederich, Thomas Gaslonde, Florian Huber, Toh-Seok Kam, Yun Yee Low, Erwan Poupon, Justin J. J. van der Hooft, Kyo Bin Kang, Pierre Le Pogam, Mehdi A. Beniddir","doi":"10.1186/s13321-025-01009-0","DOIUrl":"10.1186/s13321-025-01009-0","url":null,"abstract":"<div><p>With over 3000 representatives, the monoterpene indole alkaloids (MIAs) class is among the most diverse families of plant natural products. The MS/MS spectral space exploration of these complex compounds using chemoinformatic and computational mass spectrometry tools offers a valuable opportunity to extract and share chemical insights from this emblematic family of natural products (NPs). In this work, we first present a substantially updated version of the MIADB, a database now containing 422 MS/MS spectra of MIAs that has been uploaded to the GNPS library versus 172 initial entries. We then introduce an innovative workflow that leverages hundreds of fragmentation spectra to support the FAIRification, extraction and dissemination of chemical knowledge. This workflow aims at the extraction of spectral patterns matching finely defined MIA skeletons. These extracted signatures can then be queried against complex biological extract datasets using MassQL. By applying this strategy to an LC-MS/MS dataset of 75 plant extracts, our results demonstrated the efficiency of this approach in identifying the diversity of MIA skeletons present in the analyzed samples. Additionally, our work enabled the digitization of structural data for diverse MIA skeletons by converting them into machine-readable formats and thereby enhancing their dissemination for the scientific community.</p><p><b>Scientific contribution</b> A comprehensive investigation of the monoterpene indole alkaloid chemical space, aiming to highlight skeleton-dependent fragmentation similarity trends and to generate valuable spectrometric signatures that could be used as queries.</p><h3>Graphical Abstract</h3>\u0000<div><figure><div><div><picture><source><img></source></picture></div></div></figure></div></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01009-0","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143883668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Maria H. Rasmussen, Magnus Strandgaard, Julius Seumer, Laura K. Hemmingsen, Angelo Frei, David Balcells, Jan H. Jensen
{"title":"SMILES all around: structure to SMILES conversion for transition metal complexes","authors":"Maria H. Rasmussen, Magnus Strandgaard, Julius Seumer, Laura K. Hemmingsen, Angelo Frei, David Balcells, Jan H. Jensen","doi":"10.1186/s13321-025-01008-1","DOIUrl":"10.1186/s13321-025-01008-1","url":null,"abstract":"<div><p>We present a method for creating RDKit-parsable SMILES for transition metal complexes (TMCs) based on xyz-coordinates and overall charge of the complex. This can be viewed as an extension to the program xyz2mol that does the same for organic molecules. The only dependency is RDKit, which makes it widely applicable. One thing that has been lacking when it comes to generating SMILES from structure for TMCs is an existing SMILES dataset to compare with. Therefore, sanity-checking a method has required manual work. Therefore, we also generate SMILES two other ways; one where ligand charges and TMC connectivity are based on natural bond orbital (NBO) analysis from density functional theory (DFT) calculations utilizing recent work by Kneiding et al. (Digit Discov 2: 618–633, 2023). Another one fixes SMILES available through the Cambridge Structural Database (CSD), making them parsable by RDKit. We compare these three different ways of obtaining SMILES for a subset of the CSD (tmQMg) and find >70% agreement for all three pairs. We utilize these SMILES to make simple molecular fingerprint (FP) and graph-based representations of the molecules to be used in the context of machine learning. Comparing with the graphs made by Kneiding et al. where nodes and edges are featurized with DFT properties, we find that depending on the target property (polarizability, HOMO-LUMO gap or dipole moment) the SMILES based representations can perform equally well. This makes them very suitable as baseline-models. Finally we present a dataset of 227k RDKit parsable SMILES for mononuclear TMCs in the CSD.</p><p><b>Scientific contribution</b> We present a method that can create RDKit-parsable SMILES strings of transition metal complexes (TMCs) from Cartesian coordinates and use it to create a dataset of 227k TMC SMILES strings. The RDKit-parsability allows us to generate perform machine learning studies of TMC properties using ”standard” molecular representations such as fingerprints and 2D-graph convolution. We show that these relatively simple representations can perform quite well depending on the target property.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":7.1,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01008-1","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143883667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}