Amelia Villegas-Morcillo, Gijs J. Admiraal, Marcel J. T. Reinders, Jana M. Weber
{"title":"All-atom protein sequence design using discrete diffusion models","authors":"Amelia Villegas-Morcillo, Gijs J. Admiraal, Marcel J. T. Reinders, Jana M. Weber","doi":"10.1186/s13321-025-01121-1","DOIUrl":"10.1186/s13321-025-01121-1","url":null,"abstract":"<div><p>Advancing protein design is crucial for breakthroughs in medicine and biotechnology. Traditional approaches for protein sequence representation often rely solely on the 20 canonical amino acids, limiting the representation of non-canonical amino acids and residues that undergo post-translational modifications. This work explores discrete diffusion models for generating novel protein sequences using the all-atom chemical representation SELFIES. By encoding the atomic composition of each amino acid in the protein, this approach expands the design possibilities beyond standard sequence representations. Using a modified ByteNet architecture within the discrete diffusion D3PM framework, we evaluate the impact of this all-atom representation on protein quality, diversity, and novelty, compared to conventional amino acid-based models. To this end, we develop a comprehensive assessment pipeline to determine whether generated SELFIES sequences translate into valid proteins containing both canonical and non-canonical amino acids. Additionally, we examine the influence of two noise schedules within the diffusion process—uniform (random replacement of tokens) and absorbing (progressive masking)—on generation performance. While models trained on the all-atom representation struggle to consistently generate fully valid proteins, the successfully generated proteins show improved novelty and diversity compared to their amino acid-based model counterparts. Furthermore, the all-atom representation achieves structural foldability results comparable to those of amino acid-based models. Lastly, our results highlight the absorbing noise schedule as the most effective for both representations. Data and code are available at https://github.com/Intelligent-molecular-systems/All-Atom-Protein-Sequence-Generation.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"18 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1186/s13321-025-01121-1.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145652885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bruno Macedo, Inês Ribeiro Vaz, Tiago Taveira Gomes
{"title":"Novel molecule design with POWGAN, a policy-optimized Wasserstein generative adversarial network","authors":"Bruno Macedo, Inês Ribeiro Vaz, Tiago Taveira Gomes","doi":"10.1186/s13321-025-01114-0","DOIUrl":"10.1186/s13321-025-01114-0","url":null,"abstract":"<p>Generative artificial intelligence has the potential to open new vast chemical search spaces, yet existing reinforcement-guided generative adversarial networks (GANs) struggle to produce non-fragmented and property-oriented molecules at scale without compromising other properties. To overcome these limitations, we present Policy-Optimised Wasserstein GAN (POWGAN), a graph-based generator that incorporates a dynamically scaled reward into adversarial training. The scaling factor increases when progress stalls, keeping gradients informative while steadily steering the generator towards user-defined objectives. When POWGAN replaces the loss function in a previous MedGAN architecture, using graph connectivity (non-fragmentation) as the target property, attains 1.00 fully connected quinoline-like molecules, compared to previous 0.62, while maintaining novelty (0.93) and uniqueness (0.95). The resulting model R-MedGAN produces > 12,000 novel quinoline-like, a significant increase over its predecessor under identical experimental conditions. Chemical space visualizations demonstrate that these molecules populate regions not present in the training dataset or MedGAN, confirming genuine scaffold innovation. By achieving a new architecture capable of orienting generative process towards a reward, our study also showed this strategy is capable of progressing towards druglikeness properties. Synthetic Accessibility Scores (SAS) measured by Erlth algorithm between 1 and 6, and lipophilicity measured as LogP between 1.35 and 1.80, both increased the proportion from 8 to 65% and 17% to 45%, respectively, compared to baseline. Our study shows R-MedGAN architecture, incorporating POWGAN loss, is also generalizable for models trained with different molecular scaffolds other than quinoline originally tested in MedGAN (R-MedGAN-QNL). For indole (R-MedGAN-IND) and imidazole (R-MedGAN-IMZ) datasets, connectivity increased from 0.38 and 0.50 up to 1.00 during training. This study provides evidence that an adaptive reward-scaling policy in a Wasserstein GAN can simultaneously guide the generative training towards a reward by enhancing molecular connectivity, expand generative throughput, preserve diversity, and improve drug-likeness properties. By eliminating the limitation trade-off between property optimisation and sample diversity, POWGAN and its R-MedGAN implementation advance the state of the art in molecule-generating GANs and deploys a robust, scalable platform for high-throughput, goal-directed chemical exploration in early-stage drug discovery. These findings underscore the effectiveness of adaptive reinforcement-driven strategies in generative adversarial networks oriented by rewards for molecular discovery.</p><p>In this work we introduce POWGAN, a policy-optimized Wasserstein GAN that uses adaptive reward scaling to improve goal-directed molecule generation. Integrated into MedGAN (R-MedGAN), it increases the number of valid, connected, and novel mol","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"18 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1186/s13321-025-01114-0.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145652913","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"How to build machine learning models able to extrapolate from standard to modified peptides","authors":"Raúl Fernández-Díaz, Rodrigo Ochoa, Thanh Lam Hoang, Vanessa Lopez, Denis C. Shields","doi":"10.1186/s13321-025-01115-z","DOIUrl":"10.1186/s13321-025-01115-z","url":null,"abstract":"<div><p>Bioactive peptides are an important class of natural products with great functional versatility. Chemical modifications can improve their pharmacology, yet their structural diversity presents unique challenges for computational modeling. Furthermore, data for standard peptides (composed of the 20 canonical amino acids) is more abundant than for modified ones. Thus, we set out to identify whether predictive models fitted to standard data are reliable when applied to modified peptides. To do this, we first considered two critical aspects of the modeling problem, namely, choice of similarity function for guiding dataset partitioning and choice of molecular representation. Similarity-based dataset partitioning is an evaluation technique that divides the dataset into train and test subsets, such that the molecules in the test set are different from those used to fit the model.</p><h3>Graphical Abstract</h3><div><figure><div><div><picture><source><img></source></picture></div></div></figure></div></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12751563/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145626959","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Nipah Virus Inhibitor Knowledgebase (NVIK): a combined evidence approach to prioritise small molecule inhibitors","authors":"Bhupender Singh, Nishi Kumari, Ayush Upadhyay, Bhavini Pahuja, Eugenia Covernton, Kishan Kalia, Kanika Tuteja, Priyanka Rani Paul, Rakesh Kumar, Mayur Sudhakar Zarkar, Anshu Bhardwaj","doi":"10.1186/s13321-025-01049-6","DOIUrl":"10.1186/s13321-025-01049-6","url":null,"abstract":"<div><p>Nipah Virus (NiV) came into limelight due to an outbreak in Kerala, India. NiV infection can cause severe respiratory and neurological problems with fatality rate of 40–70%. It is a public health concern and has the potential to become a global pandemic. Lack of treatment has forced the containment methods to be restricted to isolation and surveillance. WHO’s ‘R&D Blueprint list of priority diseases’ (2018) indicates that there is an urgent need for accelerated research & development for addressing NiV. In the quest for druglike NiV inhibitors (NVIs) a thorough literature search followed by systematic data curation was conducted. Rigorous data analysis was done with curated NVIs for prioritising curated compounds. Our efforts led to the creation of Nipah Virus Inhibitor Knowledgebase (NVIK), a well-curated structured knowledgebase of 220 NVIs with 142 unique small molecule inhibitors. The reported IC50/EC50 values for some of these inhibitors are in the nanomolar range—as low as 0.47 nM. Of 142 unique small-molecule inhibitors, 124 (87.32%) compounds cleared the PAINS filter. The clustering analysis identified more than 90% of the NVIs as singletons signifying their diverse structural features. This diverse chemical space can be utilized in numerous ways to develop druglike anti-nipah molecules. Further, we prioritised top 10 NVIs, based on robustness of assays, physicochemical properties and their toxicity profiles. All the NVIs related information including their structures, physicochemical properties, similarity analysis with FDA approved drugs and other chemical libraries along with predicted ADMET profiles are freely accessible at https://datascience.imtech.res.in/anshu/nipah/. The NVIK has the provision to submit new inhibitors as and when reported by the community for further enhancement of the NVIs landscape.</p><p>Scientific contribution</p><p>The NVIK is a dedicated resource for NiV drug discovery containing manually curated NVIs. The NVIs are structurally mapped with known chemical space to identify their structural diversity and recommend strategies for chemical library expansion. Also, in NVIK a combined evidence-based strategy is used to prioritise these inhibitors.</p><h3>Graphical Abstract</h3>\u0000<div><figure><div><div><picture><source><img></source></picture></div></div></figure></div></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1186/s13321-025-01049-6.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145583550","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Beyond performance: how design choices shape chemical language models","authors":"Inken Fender, Jannik Adrian Gut, Thomas Lemmin","doi":"10.1186/s13321-025-01099-w","DOIUrl":"10.1186/s13321-025-01099-w","url":null,"abstract":"<div><p>Chemical language models (CLMs) have shown strong performance in molecular property prediction and generation tasks. However, the impact of design choices, such as molecular representation format, tokenization strategy, and model architecture, on both performance and chemical interpretability remains underexplored. In this study, we systematically evaluate how these factors influence CLM performance and chemical understanding. We evaluated models through fine-tuning on downstream tasks and probing the structure of their latent spaces using probing predictors, vector operations, and dimensionality reduction techniques. Although downstream task performance was similar across model configurations, substantial differences were observed in the structure and interpretability of internal representations, highlighting that design choices meaningfully shape how chemical information is encoded. In practice, atomwise tokenization generally improved interpretability, and a RoBERTa-based model with SMILES input remains a reliable starting point for standard prediction tasks, as no alternative consistently outperformed it. These results provide guidance for the development of more chemically grounded and interpretable CLMs.</p><h3>Graphical Abstract</h3>\u0000<div><figure><div><div><picture><source><img></source></picture></div></div></figure></div></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01099-w","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145535347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"NPBS Atlas: a comprehensive data resource for exploring the biological sources of natural products","authors":"Tingjun Xu, Jinfang Dai, Yingyong Li, Junhong Zhou, Yingli Zhao, Weiming Chen, Xiao-Song Xue","doi":"10.1186/s13321-025-01116-y","DOIUrl":"10.1186/s13321-025-01116-y","url":null,"abstract":"<div><p>Natural products continue to play a pioneering role in drug discovery due to their extraordinary chemical and biological diversity. However, their full therapeutic potential remains largely underutilized, hindered by the fragmented documentation of biological origins in existing data resources. Here, we present natural product and biological source atlas (NPBS Atlas), a data resource covers over 218,000 natural products fully annotated with comprehensive biological sources, bioactivities, and references. The database established through systematic text mining and expert manual curation, places special emphasis on curating source organism data through the information of scientific nomenclature, taxonomic classification, source parts, and the source of Traditional Chinese Medicines. NPBS Atlas represents significant advancement in natural product data resources through its unique content, specialized annotations, and featured data, thereby enabling unprecedented exploration of nature-derived chemical diversity through biological context. The web interface of NPBS Atlas is freely available at https://biochemai.cstspace.cn/npbs/.</p><h3>Graphical Abstract</h3><div><figure><div><div><picture><source><img></source></picture></div></div></figure></div></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01116-y","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145535661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Predicting the critical micelle concentration of binary surfactant mixtures using machine learning","authors":"Aditya Choudhary, Saaketh Desai, Methun Kamruzzaman, Alexander Landera, Koushik Ghosh, Kunal Poorey","doi":"10.1186/s13321-025-01112-2","DOIUrl":"10.1186/s13321-025-01112-2","url":null,"abstract":"<div><p>Surfactant mixtures play a critical role in industries such as drug delivery, cosmetics, firefighting foams, and lubrication, serving as foundational components of the global economy. Their performance hinges on micelle formation, a self-assembly process governed by the critical micelle concentration (CMC), which enables key functions like solubilization, emulsification, and targeted molecular delivery. However, rapidly and accurately predicting the CMC of mixtures remains a significant challenge due to the chemical diversity and nonlinear interactions between surfactants. Here, we introduce an artificial neural network (ANN)-based machine learning framework to predict the CMC of binary surfactant mixtures. Our workflow leverages cheminformatics-derived molecular descriptors for each surfactant component, which are then aggregated using strategies such as concatenation, arithmetic mean, and harmonic mean. We find that pairing the arithmetic mean strategy with ANN yields the best performance, effectively capturing complex molecular interactions and enabling dual predictive capabilities: (1) precise interpolation of CMC values at untested mole fractions within known mixtures, and (2) accurate prediction of complete CMC–composition profiles for entirely novel surfactant combinations. SHAP-based interpretability analysis highlights that features such as hydrophobic surface area, electronic topological descriptors, and headgroup basicity drive model predictions, aligning with core principles of surfactant chemistry and reinforcing the mechanistic validity of our model. Overall, this framework accelerates data-driven surfactant design by reducing experimental burden and enabling rapid, rational optimization of formulations across pharmaceuticals, personal care, environmental remediation, and enhanced oil recovery.</p><p><b>Scientific contribution</b></p><p>This study presents a novel machine learning framework that, for the first time, predicts full critical micelle concentration (CMC)–composition profiles for binary surfactant mixtures, including untrained systems. By strategically combining the features of individual components of mixtures using arithmetic mean, our artificial neural network model deciphers nonlinear interactions between chemically distinct surfactants, enabling accurate and generalizable CMC predictions. Beyond performance gains, this framework facilitates rapid and systematic exploration of formulation space via inverse design and high-throughput screening, establishing a powerful foundation for the rational development of next-generation surfactants with applications in energy, environmental remediation, pharmaceuticals, and biomedical science.</p><h3>Graphical Abstract</h3>\u0000<div><figure><div><div><picture><source><img></source></picture></div></div></figure></div></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01112-2","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145492529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"How evaluation choices distort the outcome of generative drug discovery","authors":"Rıza Özçelik, Francesca Grisoni","doi":"10.1186/s13321-025-01108-y","DOIUrl":"10.1186/s13321-025-01108-y","url":null,"abstract":"<p>“How to evaluate the de novo designs proposed by a generative model?” Despite the transformative potential of generative deep learning in drug discovery, this seemingly simple question has no clear answer. The absence of standardized guidelines challenges both the benchmarking of generative approaches and the selection of molecules for prospective studies. In this work, we take a fresh – <i>critical</i> and <i>constructive </i>– perspective on de novo design evaluation. By training chemical language models, we analyze approximately 1 billion molecule designs and discover principles consistent across different neural networks and datasets. We uncover a key confounder: the size of the generated molecular library significantly impacts evaluation outcomes, often leading to misleading model comparisons. We find increasing the number of designs as a remedy and propose new and compute-efficient metrics to compute at large-scale. We also identify critical pitfalls in commonly used metrics — such as uniqueness and distributional similarity — that can distort assessments of generative performance. To address these issues, we propose new and refined strategies for reliable model comparison and design evaluation. Furthermore, when examining molecule selection and sampling strategies, our findings reveal the constraints to diversify the generated libraries and draw new parallels and distinctions between deep learning and drug discovery. We anticipate our findings to help reshape evaluation pipelines in generative drug discovery, paving the way for more reliable and reproducible generative modeling approaches.</p><p> Our work takes a step toward enhancing the robustness and reliability of evaluation practices in generative drug discovery. We systematically analyze current evaluation practices using approximately one billion designs from deep learning models. We find that the number of designs, often an overlooked parameter, can distort scientific outcomes related to distributional similarity and diversity. Moreover, we show that using larger design libraries than are typically adopted helps to avoid this pitfall, and we develop efficient algorithms to enable large-scale studies. We also propose guidelines for prospective molecule selection and uncover inherent constraints in diversifying molecular designs.</p>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01108-y","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145492530","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Enhancing multi-task in vivo toxicity prediction via integrated knowledge transfer of chemical knowledge and in vitro toxicity information","authors":"Minsu Park, Yewon Shin, Hyunho Kim, Hojung Nam","doi":"10.1186/s13321-025-01110-4","DOIUrl":"10.1186/s13321-025-01110-4","url":null,"abstract":"<div><p>The evaluation of potential drug toxicity is a crucial step in early drug development. in vivo toxicity assessment represents a key challenge that must be addressed before advancing to clinical trials. However, traditional in vivo experiments primarily rely on animal models, raising concerns regarding cost, time efficiency, and ethical considerations. To address these challenges, various computational approaches have been developed to support in vivo toxicity evaluations, though these methods often demonstrate limited generalizability due to data scarcity. In this study, we propose MT-Tox, a knowledge transfer-based multi-task learning model specifically designed for in vivo toxicity prediction that overcomes data scarcity. Our model implements a sequential knowledge transfer strategy across three stages: general chemical knowledge pretraining, in vitro toxicological auxiliary training, and in vivo toxicity fine-tuning. This hierarchical approach significantly improves model performance by systematically leveraging information from both chemical structure and toxicity data sources. MT-Tox outperforms baseline models across three in vivo toxicity endpoints: carcinogenicity, drug-induced liver injury (DILI), and genotoxicity. Through ablation studies and attention analyses, we demonstrate that each knowledge transfer technique makes meaningful contributions to the prediction process. Finally, we demonstrate the real-world application of our model as a prediction tool for early-stage drug discovery through comprehensive DrugBank database screening.</p><p><b>Scientific contribution:</b> We propose a knowledge transfer framework that integrates chemical and in vitro toxicological information to enhance in vivo toxicity prediction in low-data regimes. Our model provides dual-level interpretability across chemical and biological domains through attention mechanism. Moreover, we demonstrate our model’s applicability by screening the DrugBank database, simulating practical toxicity screening scenarios in drug development.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-11-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01110-4","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145492504","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Roy Aerts, Joris Tavernier, Alan Kerstjens, Mazen Ahmad, Jose Carlos Gómez-Tamayo, Gary Tresadern, Hans De Winter
{"title":"C2PO: an ML-powered optimizer of the membrane permeability of cyclic peptides through chemical modification","authors":"Roy Aerts, Joris Tavernier, Alan Kerstjens, Mazen Ahmad, Jose Carlos Gómez-Tamayo, Gary Tresadern, Hans De Winter","doi":"10.1186/s13321-025-01109-x","DOIUrl":"10.1186/s13321-025-01109-x","url":null,"abstract":"<div><p>Peptide drug development is currently receiving due attention as a modality between small and large molecules. Therapeutic peptides represent an opportunity to achieve high potency, selectivity, and reach intracellular targets. A new era in the development of therapeutic peptides emerged with the arrival of cyclic peptides which avoid the limitations of parenteral administration via achieving sufficient oral bioavailability. However, improving the membrane permeability of cyclic peptides remains one of the principal bottlenecks. Here, we introduce a deep learning regression model of cyclic peptide membrane permeability based on publicly available data. The model starts with a chemical structure and goes beyond the limited vocabulary language models to generalize to monomers beyond the ones in the training dataset. Moreover, we introduce an efficient <i>estimator2generative</i> wrapper to enable using the model in direct molecular optimization of membrane permeability via chemical modification. We name our application <i>C2PO</i> (Cyclic Peptide Permeability Optimizer). Lastly, we demonstrate how a molecule correction tool can be used to limit the presence of unfamiliar chemistry in the generated molecules.</p><p><b>Scientific contribution</b>: We provide an ML-driven optimizer application, named C2PO, that returns structurally modified cyclic peptides with an improved membrane permeability, one of the pivotal tasks in drug discovery and development. C2PO is a first-in-class application for cyclic peptide permeability amelioration, in that it converts a ML model into a generative optimizer of chemical structures. Additionally, through demonstration we incentivize the usage of an automated post-correction tool with a chemistry reference library to correct strange chemistry outputs from C2PO, a known issue for ML-generated chemical structures.</p></div>","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"17 1","pages":""},"PeriodicalIF":5.7,"publicationDate":"2025-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://jcheminf.biomedcentral.com/counter/pdf/10.1186/s13321-025-01109-x","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145491700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}