Reverse engineering molecules from fingerprints through deterministic enumeration and generative models

IF 5.7 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY
Philippe Meyer, Thomas Duigou, Guillaume Gricourt, Jean-Loup Faulon
{"title":"Reverse engineering molecules from fingerprints through deterministic enumeration and generative models","authors":"Philippe Meyer, Thomas Duigou, Guillaume Gricourt, Jean-Loup Faulon","doi":"10.1186/s13321-025-01074-5","DOIUrl":null,"url":null,"abstract":"Reverse engineering in molecular design aims to identify optimal structures based on activities, or properties, computed through molecular descriptors like fingerprints. This task is known to be particularly difficult for the widely used Extended-Connectivity Fingerprints (ECFPs), due to significant loss of structural information during vectorization. While recent artificial intelligence-based works have raised awareness about the privacy risks associated with ECFP-based data sharing, we contribute a more conclusive demonstration by introducing a deterministic algorithm that reconstructs molecular structures from ECFPs. Using MetaNetX and eMolecules as databases of natural compounds and commercially available chemicals, the deterministic algorithm benchmarks a Transformer-based generative model trained to predict SMILES from ECFPs. The generative model achieves a top-ranked retrieval accuracy of 95.64% but struggles with exhaustive enumeration. Additionally, applying the deterministic method to a drug dataset reveals its potential for de novo drug design, as many of the reverse-engineered structures are found to be patented or supported by bioassay data. We present a deterministic algorithm that reconstructs molecular structures from ECFP vectors, demonstrating that these fingerprints are invertible. In parallel, we benchmark a Transformer-based generative model trained to predict SMILES from ECFPs, showing high accuracy but limitations in chemical space coverage. This dual approach advances reverse engineering in molecular design, offering new tools for de novo drug discovery.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"82 1","pages":""},"PeriodicalIF":5.7000,"publicationDate":"2025-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Cheminformatics","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1186/s13321-025-01074-5","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

Abstract

Reverse engineering in molecular design aims to identify optimal structures based on activities, or properties, computed through molecular descriptors like fingerprints. This task is known to be particularly difficult for the widely used Extended-Connectivity Fingerprints (ECFPs), due to significant loss of structural information during vectorization. While recent artificial intelligence-based works have raised awareness about the privacy risks associated with ECFP-based data sharing, we contribute a more conclusive demonstration by introducing a deterministic algorithm that reconstructs molecular structures from ECFPs. Using MetaNetX and eMolecules as databases of natural compounds and commercially available chemicals, the deterministic algorithm benchmarks a Transformer-based generative model trained to predict SMILES from ECFPs. The generative model achieves a top-ranked retrieval accuracy of 95.64% but struggles with exhaustive enumeration. Additionally, applying the deterministic method to a drug dataset reveals its potential for de novo drug design, as many of the reverse-engineered structures are found to be patented or supported by bioassay data. We present a deterministic algorithm that reconstructs molecular structures from ECFP vectors, demonstrating that these fingerprints are invertible. In parallel, we benchmark a Transformer-based generative model trained to predict SMILES from ECFPs, showing high accuracy but limitations in chemical space coverage. This dual approach advances reverse engineering in molecular design, offering new tools for de novo drug discovery.
逆向工程分子从指纹通过确定性枚举和生成模型
分子设计中的逆向工程旨在通过像指纹这样的分子描述符来确定基于活动或性质的最佳结构。众所周知,对于广泛使用的扩展连接指纹(ECFPs)来说,由于在向量化过程中大量丢失结构信息,这项任务尤其困难。虽然最近基于人工智能的工作提高了人们对与基于ecfp的数据共享相关的隐私风险的认识,但我们通过引入一种从ecfp重建分子结构的确定性算法,提供了更确凿的论证。使用MetaNetX和emolules作为天然化合物和商业化学物质的数据库,确定性算法对基于transformer的生成模型进行基准测试,以训练从ecfp中预测smile。生成模型达到95.64%的检索精度,但在穷举枚举方面存在问题。此外,将确定性方法应用于药物数据集揭示了其重新设计药物的潜力,因为许多反向工程结构被发现是专利或由生物测定数据支持的。我们提出了一种确定性算法,从ECFP向量重建分子结构,证明这些指纹是可逆的。与此同时,我们对基于transformer的生成模型进行了基准测试,该模型经过训练,可以从ecfp中预测smile,显示出很高的准确性,但在化学空间覆盖方面存在局限性。这种双重方法推进了分子设计中的逆向工程,为新药物发现提供了新的工具。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Journal of Cheminformatics
Journal of Cheminformatics CHEMISTRY, MULTIDISCIPLINARY-COMPUTER SCIENCE, INFORMATION SYSTEMS
CiteScore
14.10
自引率
7.00%
发文量
82
审稿时长
3 months
期刊介绍: Journal of Cheminformatics is an open access journal publishing original peer-reviewed research in all aspects of cheminformatics and molecular modelling. Coverage includes, but is not limited to: chemical information systems, software and databases, and molecular modelling, chemical structure representations and their use in structure, substructure, and similarity searching of chemical substance and chemical reaction databases, computer and molecular graphics, computer-aided molecular design, expert systems, QSAR, and data mining techniques.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信