Philippe Meyer, Thomas Duigou, Guillaume Gricourt, Jean-Loup Faulon
{"title":"Reverse engineering molecules from fingerprints through deterministic enumeration and generative models","authors":"Philippe Meyer, Thomas Duigou, Guillaume Gricourt, Jean-Loup Faulon","doi":"10.1186/s13321-025-01074-5","DOIUrl":null,"url":null,"abstract":"Reverse engineering in molecular design aims to identify optimal structures based on activities, or properties, computed through molecular descriptors like fingerprints. This task is known to be particularly difficult for the widely used Extended-Connectivity Fingerprints (ECFPs), due to significant loss of structural information during vectorization. While recent artificial intelligence-based works have raised awareness about the privacy risks associated with ECFP-based data sharing, we contribute a more conclusive demonstration by introducing a deterministic algorithm that reconstructs molecular structures from ECFPs. Using MetaNetX and eMolecules as databases of natural compounds and commercially available chemicals, the deterministic algorithm benchmarks a Transformer-based generative model trained to predict SMILES from ECFPs. The generative model achieves a top-ranked retrieval accuracy of 95.64% but struggles with exhaustive enumeration. Additionally, applying the deterministic method to a drug dataset reveals its potential for de novo drug design, as many of the reverse-engineered structures are found to be patented or supported by bioassay data. We present a deterministic algorithm that reconstructs molecular structures from ECFP vectors, demonstrating that these fingerprints are invertible. In parallel, we benchmark a Transformer-based generative model trained to predict SMILES from ECFPs, showing high accuracy but limitations in chemical space coverage. This dual approach advances reverse engineering in molecular design, offering new tools for de novo drug discovery.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"82 1","pages":""},"PeriodicalIF":5.7000,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Cheminformatics","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1186/s13321-025-01074-5","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0
Abstract
Reverse engineering in molecular design aims to identify optimal structures based on activities, or properties, computed through molecular descriptors like fingerprints. This task is known to be particularly difficult for the widely used Extended-Connectivity Fingerprints (ECFPs), due to significant loss of structural information during vectorization. While recent artificial intelligence-based works have raised awareness about the privacy risks associated with ECFP-based data sharing, we contribute a more conclusive demonstration by introducing a deterministic algorithm that reconstructs molecular structures from ECFPs. Using MetaNetX and eMolecules as databases of natural compounds and commercially available chemicals, the deterministic algorithm benchmarks a Transformer-based generative model trained to predict SMILES from ECFPs. The generative model achieves a top-ranked retrieval accuracy of 95.64% but struggles with exhaustive enumeration. Additionally, applying the deterministic method to a drug dataset reveals its potential for de novo drug design, as many of the reverse-engineered structures are found to be patented or supported by bioassay data. We present a deterministic algorithm that reconstructs molecular structures from ECFP vectors, demonstrating that these fingerprints are invertible. In parallel, we benchmark a Transformer-based generative model trained to predict SMILES from ECFPs, showing high accuracy but limitations in chemical space coverage. This dual approach advances reverse engineering in molecular design, offering new tools for de novo drug discovery.
期刊介绍:
Journal of Cheminformatics is an open access journal publishing original peer-reviewed research in all aspects of cheminformatics and molecular modelling.
Coverage includes, but is not limited to:
chemical information systems, software and databases, and molecular modelling,
chemical structure representations and their use in structure, substructure, and similarity searching of chemical substance and chemical reaction databases,
computer and molecular graphics, computer-aided molecular design, expert systems, QSAR, and data mining techniques.