{"title":"Transformer Learning in Sequence-Based Drug Design Depends on Compound Memorization and Similarity of Sequence-Compound Pairs.","authors":"Jürgen Bajorath","doi":"10.1002/minf.70016","DOIUrl":"10.1002/minf.70016","url":null,"abstract":"<p><p>Chemical language models (CLMs), particularly encoder-decoder transformers, have advanced generative molecular design. Transformer CLMs are able to learn a variety of molecular mappings for compound design that can be conditioned using context-dependent rules. However, their black-box nature complicates the interpretation of predictions. Current analysis methods mostly focus on attention weights of token relationships or attention flow in encoder and decoder modules and cannot explain predictions at the molecular level. Sequence-based compound design was used as a model system to investigate transformer learning characteristics through systematic control calculations involving modifications of protein sequences and sequence-compound pairs. The analysis revealed that compound reproducibility depended on similarity relationships between training and test data and on compound memorization, while specific sequence information was not learned. These findings indicate that predictions of transformer CLMs are driven by memorization effects and statistical correlations rather than by learning specific chemical or biological information. Understanding this learning behavior aids in avoiding over-interpretation of model outputs and informs the appropriate application of transformer-based CLMs in molecular design.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"45 1","pages":"e70016"},"PeriodicalIF":3.1,"publicationDate":"2026-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12782052/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145934112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Structure-Activity Relationships and Design of Focused Libraries Tailored for Staphylococcus Aureus Inhibition.","authors":"Alberto Marbán-González, José L Medina-Franco","doi":"10.1002/minf.70015","DOIUrl":"10.1002/minf.70015","url":null,"abstract":"<p><p>Staphylococcus aureus is a bacterium classified among the ESKAPE pathogens, which are anticipated to pose a significant global health emergency in the coming decades. The FabI enzyme, present in both Gram-positive and Gram-negative bacteria, is a key enzyme involved in fatty acid synthesis II (FAS-II). In this study, we utilized transformation rules to expand the chemical space from the most potent S. aureus FabI inhibitors. Three newly generated focused libraries, named INDDS, DIADS, and PYRDS, encompassed 172,026 compounds. These compounds were ranked based on structural similarity and predicted pIC<sub>50</sub> values obtained from machine learning models. This approach allowed to prioritize compounds in each focused library targeting S. aureus FabI. We analyzed the pharmacological properties and chemical space diversity of the S. aureus FabI inhibitors to gather relevant insights and support the prioritization of compounds for further study. The three newly generated libraries are freely available at https://github.com/DIFACQUIM/S.aureus_inhibitors.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"44 11-12","pages":"e70015"},"PeriodicalIF":3.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12694758/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145724727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alejandro Gómez-García, Martin J Lavecchia, Dionisio A Olmedo, Pablo N Solís, José L Medina-Franco
{"title":"Update and ADMET Profile of the Latin American Natural Product Database: LANaPDB.","authors":"Alejandro Gómez-García, Martin J Lavecchia, Dionisio A Olmedo, Pablo N Solís, José L Medina-Franco","doi":"10.1002/minf.70013","DOIUrl":"10.1002/minf.70013","url":null,"abstract":"<p><p>For more than 5 years, several countries in Latin America have been developing and updating compound databases of natural products (NPs) isolated and characterized by their countries. In parallel, multiple research groups have been collaborating and assembling a unified Latin American Natural Product Database (LANaPDB), an open-access compound collection representative of Latin America that stands out as a geographical region distinct from its vastness and richness of NP resources. Herein, we report a significant update of LANaPDB, which gathers NPs from eight countries. Major updates to the database include adding 1,164 new compounds obtained from NaturAr, a NP collection from Argentina published in 2025, and 132 new compounds from Panama. The updated LANaPDB has 14,742 nonduplicate compounds. Moreover, a comprehensive evaluation of 41 ADMET (absorption, distribution, metabolism, excretion, and toxicity)-related parameters was carried out for LANaPDB, and the results were compared with one of the largest NP databases, the Universal Natural Product Database, and the approved small-molecule drugs. The results indicated that the three databases have a very similar ADMET profile. Besides, most of the LANaPDB compounds presented high bioavailability, volume of distribution, plasma protein binding rate, blood-brain barrier penetration, susceptibility to CYP3A4, and half-life less than 12 h. Moreover, most of the LANaPDB compounds were predicted with a low probability of inducing toxicity-related reactions. The third version of LANaPDB and the codes for the curation and determination of 41 ADMET-related parameters are freely available at https://doi.org/10.5281/zenodo.15595030. The code is general and can be used to analyze other compound libraries.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"44 11-12","pages":"e70013"},"PeriodicalIF":3.1,"publicationDate":"2025-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12694011/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145715146","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploration of (Ultra)Big Chemical Spaces.","authors":"José L Medina-Franco","doi":"10.1002/minf.70012","DOIUrl":"https://doi.org/10.1002/minf.70012","url":null,"abstract":"","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"44 10","pages":"e70012"},"PeriodicalIF":3.1,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145588154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Ligand B-Factor Index: A Metric for Prioritizing Protein-Ligand Complexes in Docking.","authors":"Liliana Halip, Cristian Neanu, Sorin Avram","doi":"10.1002/minf.70010","DOIUrl":"10.1002/minf.70010","url":null,"abstract":"<p><p>Docking is a structure-based cheminformatics tool broadly employed in early drug discovery. Based on the tridimensional structure of the protein target, docking is used to predict the binding interactions between the protein and a ligand, estimate the corresponding binding affinity, or perform virtual screenings (VSs) to identify new active compounds. This study introduces the ligand B-factor index (LBI), a novel computational metric for prioritizing protein-ligand complexes for docking. Unlike other metrics, LBI directly compares atomic displacements in the ligand and binding site. LBI is defined as the ratio of the median atomic B-factor of the binding site to that of the bound ligand. Using the comparative assessment of scoring functions (CASF-2016) dataset, we evaluated the effectiveness of LBI in guiding the selection of protein-ligand complexes to enhance docking performance. Our results show a moderate correlation (Spearman ρ ~ 0.48) between LBI and the experimental binding affinities, outperforming several docking scoring functions. Additionally, LBI correlates with improved redocking success (root mean square deviation < 2 Å), underlying the significance of a ligand-focused metric. While LBI outperforms other metrics such as the protein B-factor index and resolution, its utility in VS docking remains to be further investigated. LBI is easy to compute, interpretable, applicable in structure-based cheminformatics, and freely available for calculation at https://chembioinf.ro/tool-bi-computing.html.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"44 9","pages":"e202500127"},"PeriodicalIF":3.1,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12423484/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145033654","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Omer Kaspi, Yaniv Y Avissar, Arnon Grafit, Ron Chibel, Olga Girshevitz, Hanoch Senderowitz
{"title":"Machine Learning-Based Identification of Petroleum Distillates and Gasoline Traces Using Measured and Synthetic GC Spectra from Collected Samples.","authors":"Omer Kaspi, Yaniv Y Avissar, Arnon Grafit, Ron Chibel, Olga Girshevitz, Hanoch Senderowitz","doi":"10.1002/minf.70008","DOIUrl":"https://doi.org/10.1002/minf.70008","url":null,"abstract":"<p><p>Ignition cases involving arsons are typically handled by forensic experts who examine spectra of samples collected from scenes of fire to test for the existence or absence of ignitable liquids. This is tedious work, since many cases do not involve such liquids. To facilitate this process, we have developed in this work a Machine Learning (ML)-based workflow for samples' classification based on their gas chromatography (GC) chromatograms (i.e., spectra). To this end, annotated spectra of 181 samples containing three groups of liquids (petroleum distillates, gasoline, and an assortment of other substances) collected from fire scenes as well as two reference databases were obtained from the Israeli Department of Identification and Forensic Sciences (DIFS). These spectra were used for the derivation of ML-based classification models using three algorithms, namely, kNN, representative spectrum, and random forest (RF) giving rise to reliable predictions. To increase the size of the dataset to a level that would enable the usage of more advanced ML algorithms, we have used the experimental spectra to develop a new spectra synthesis algorithm and utilized it to generate a large dataset of synthetic spectra. This dataset was used for the derivation of new kNN, RF, and representative spectrum models as well as deep learning (DL) models producing F1-scores over an independent test set composed entirely of \"real\" spectra ranging from 0.74-0.95, 0.86-0.95, 0.30-0.75, and 0.85-0.96 for kNN, RF, representative spectrum, and DL, respectively. Following the completion of the work, a second set of real spectra was provided to us by DIFS, and modeling it with the second set of models yielded F1-scores ranging from 0.92-0.96, 0.96-1.00, 0.71-0.82, and 0.95-0.98 for kNN, RF, representative spectrum, and DL, respectively. These results therefore suggest that for this dataset, performances depend more on the size of the dataset used for model training than on the ML algorithm. We propose that the workflow and spectra synthesis algorithm developed in this work could be readily applied to other forensic domains where samples are characterized by spectra, either solely or in combination with other parameters.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"44 8","pages":"e202400371"},"PeriodicalIF":3.1,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12371388/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144961933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Integrating Generative Pretrained Transformer and Genetic Algorithms for Efficient and Diverse Molecular Generation.","authors":"Chengcheng Xu, Chen Zeng, Xi Yang, Yingxu Liu, Xiangzhen Ning, Lidan Zheng, Yang Liu, Qing Fan, Chao Xu, Haichun Liu, Xian Wei, Yadong Chen, Yanmin Zhang, Rui Gu","doi":"10.1002/minf.70005","DOIUrl":"https://doi.org/10.1002/minf.70005","url":null,"abstract":"<p><p>In computer-aided drug design, molecular generation models play a crucial role in accelerating the drug development process. Current models mainly fall into two categories: deep learning models with high performance but poor interpretability and heuristic algorithms with better interpretability but limited performance. In this study, we introduce an innovative molecular generation model, the compound construction model (CCMol), which integrates the powerful generative capabilities of the generative pretrained transformer (GPT) and the efficient optimization mechanisms of genetic algorithms (GA) to achieve effective and innovative molecular structures. Specifically, our approach uses structure-based drug design comprising both ligand and protein primary structure-based aspects. CCMol integrates GPT for initial molecular generation and GA for iterative optimization of physicochemical and biological properties. The model's reliability was validated by generating molecules targeting three critical disease-related proteins (GLP1, WRN, and JAK2). The results indicate that CCMol is on average with current advanced models in multiple indicators and performs better than the baseline model in terms of structure diversity and drug-related properties indicators, demonstrating that CCMol exhibits outstanding performance in developing novel and effective candidate drug molecules, particularly suitable for expanding the chemical validity of candidate structures at the early stages of drug discovery.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"44 8","pages":"e202500094"},"PeriodicalIF":3.1,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144784859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Esteban Bertsch-Aguilar, Antonio Piedra, Daniel Acuña, Sebastián Suñer, Sylvana Pinheiro, William J Zamora
{"title":"LiProS: Findable, Accessible, Interoperable, and Reusable Data Simulation Workflow to Predict Accurate Lipophilicity Profiles for Small Molecules.","authors":"Esteban Bertsch-Aguilar, Antonio Piedra, Daniel Acuña, Sebastián Suñer, Sylvana Pinheiro, William J Zamora","doi":"10.1002/minf.70007","DOIUrl":"https://doi.org/10.1002/minf.70007","url":null,"abstract":"<p><p>Lipophilicity is a fundamental physicochemical property widely used to evaluate key parameters in drug design, materials science, and food engineering. It plays a critical role in predicting membrane permeability, absorption, and distribution of compounds. Moreover, lipophilicity is commonly integrated into scoring functions to model biomolecular interactions and serves as an important molecular descriptor in machine learning models for property prediction and compound classification. The election of the appropriate pH-dependent lipophilicity ( <math> <semantics><mrow><mi>log</mi> <msub><mi>D</mi> <mrow><mtext>pH</mtext></mrow> </msub> </mrow> <annotation>$$ mathrm{log} {D}_{pH} $$</annotation></semantics> </math> ) model is important to ensure its accuracy. The incorporation of the ion apparent partition coefficient ( <math> <semantics> <mrow><msubsup><mi>P</mi> <mi>I</mi> <mtext>app</mtext></msubsup> </mrow> <annotation>$$ {P}_{text{I}}^{text{app}}$$</annotation></semantics> </math> ) into predictions of pH-dependent lipophilicity profiles can be essential for accurately reproducing experimental results. In accordance with the principles for findable, accessible, interoperable, and reusable data to improve data management and sharing, here, we introduce LiProS, a FAIR workflow that is easily accessible through a Google Colab notebook. LiProS assists researchers in efficiently determining the appropriate pH-dependent lipophilicity profile based on the SMILES code of their molecules of interest. In addition, LiProS demonstrated its utility in the analysis of ionizable compounds within the NAPRORE-CR natural products database, enabling the identification of the most appropriate lipophilicity formalism tailored to the physicochemical characteristics of these compounds.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"44 8","pages":"e202500136"},"PeriodicalIF":3.1,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144962005","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Structural Flexibility and Shape Similarity Contribute to Exclusive Functions of Certain Atg8 Isoforms in the Autophagy Process.","authors":"Alexey Rayevsky, Eliah Bulgakov, Mariia Stykhylias, Sergey Ozheredov, Svetlana Spivak, Yaroslav Blume","doi":"10.1002/minf.70004","DOIUrl":"https://doi.org/10.1002/minf.70004","url":null,"abstract":"<p><p>Despite the abundance of systematically collected experimental data and facts, the multistep process of autophagy still contains many dark spots. One concerns the background selectivity of interactions between certain autophagy-related protein (ATG8) isoforms and their receptors/adaptors in plants during the autophagy process. By regulating phagophore initiation, expansion, and maturation, these proteins control the assembly of numerous autophagy proteins at this key docking platform. Bioinformatics analysis of human, yeast, and plant ATG8 amino acid sequences allow us to build a sequence tree of plant ATG8s, divided in three groups. We perform a structural study aimed at revealing some of the underlying reasons for the differences in the selectivity of ATG8 isoforms. A series of molecular dynamics (MD) simulations are performed to explain the stage-dependent functionality of ATG8. The conserved secondary structure and folding across all ATG8 proteins, resulting in nearly identical protein-protein interaction interfaces, makes this study particularly important and interesting. Recognizing the dual role of the LC3 interacting region (LIR) in autophagosome biogenesis and recruitment of the anchored selective autophagy receptor (SAR), we perform a mobility domain analysis. To this end, the amino acid sequence associated with the LIR docking site (LDS) interface is localized and subjected to root mean square deviation (RMSD)-based clustering analysis. Starting from Atg8-targeted protein-peptide docking, we attempt to identify conformational changes in the contact region of the corresponding adaptors and receptors involved in the common biogenesis events in autophagy. For the molecular dynamics, we select three representatives, sharing common patterns with other members of the groups. The resulting ATG8-peptide complexes display a significant preference for binding specific partners by different ATG8 isotypes.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"44 7","pages":"e202500025"},"PeriodicalIF":2.8,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144659700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Emma Svensson, Emma Granqvist, Tomas Bastys, Christos Kannas, Mikhail Kabeshov, Samuel Genheden, Ola Engkvist, Thierry Kogej
{"title":"Network Analysis of the Organic Chemistry in Patents, Literature, and Pharmaceutical Industry.","authors":"Emma Svensson, Emma Granqvist, Tomas Bastys, Christos Kannas, Mikhail Kabeshov, Samuel Genheden, Ola Engkvist, Thierry Kogej","doi":"10.1002/minf.202500011","DOIUrl":"10.1002/minf.202500011","url":null,"abstract":"<p><p>Chemical reactions can be connected in large networks such as knowledge graphs. In this way, prior work has been able to draw meaningful conclusions about the properties and structures involved in organic chemistry reactions. However, the research has focused on public sources of organic synthesis that might lack the intricate details of the synthetic routes used in in-house drug discovery. In this work, previous analyses are expanded to also include an in-house electronic lab notebook (ELN) source, such that we can compare it to knowledge graphs that were constructed from US Patent and Trademark Office (USPTO) and Reaxys. We found that the Reaxys knowledge graph is the most interconnected and has the largest proportion of nodes belonging to the core, whereas the USPTO is much less connected and only has a small core. The ELN knowledge graph falls between these extremes in connectivity and it does not have any core. The hub molecules of ELN and USPTO are most similar, primarily represented by small, organic building blocks. We hypothesize that these differences can be attributed to the different origins of the data in the three sources. We discuss what impact this might have on synthesis prediction modelling.</p>","PeriodicalId":18853,"journal":{"name":"Molecular Informatics","volume":"44 7","pages":"e202500011"},"PeriodicalIF":2.8,"publicationDate":"2025-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12273192/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144659699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}