基于变压器的有机化合物结构解析生成化学语言人工智能模型

IF 7.1 2区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY
Xiaofeng Tan
{"title":"基于变压器的有机化合物结构解析生成化学语言人工智能模型","authors":"Xiaofeng Tan","doi":"10.1186/s13321-025-01016-1","DOIUrl":null,"url":null,"abstract":"For over half a century, computer-aided structural elucidation systems (CASE) for organic compounds have relied on complex expert systems with explicitly programmed algorithms. These systems are often computationally inefficient for complex compounds due to the vast chemical structural space that must be explored and filtered. In this study, we present a proof-of-concept transformer based generative chemical language artificial intelligence (AI) model, an innovative end-to-end architecture designed to replace the logic and workflow of the classic CASE framework for ultra-fast and accurate spectroscopic-based structural elucidation. Our model employs an encoder-decoder architecture and self-attention mechanisms, similar to those in large language models, to directly generate the most probable chemical structures that match the input spectroscopic data. Trained on ~ 102 k IR, UV, and 1H NMR spectra, it performs structural elucidation of molecules with up to 29 atoms in just a few seconds on a modern CPU, achieving a top-15 accuracy of 83%. This approach demonstrates the potential of transformer based generative AI to accelerate traditional scientific problem-solving processes. The model's ability to iterate quickly based on new data highlights its potential for rapid advancements in structural elucidation. This study introduces a transformer-based generative AI model as a novel approach to structural elucidation for organic compounds, replacing traditional CASE systems with an end-to-end encoder-decoder architecture. This work demonstrates the potential of transformer models to revolutionize CASE by significantly accelerating the elucidation process and enabling rapid iterations with new data.","PeriodicalId":617,"journal":{"name":"Journal of Cheminformatics","volume":"91 18 1","pages":""},"PeriodicalIF":7.1000,"publicationDate":"2025-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A transformer based generative chemical language AI model for structural elucidation of organic compounds\",\"authors\":\"Xiaofeng Tan\",\"doi\":\"10.1186/s13321-025-01016-1\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"For over half a century, computer-aided structural elucidation systems (CASE) for organic compounds have relied on complex expert systems with explicitly programmed algorithms. These systems are often computationally inefficient for complex compounds due to the vast chemical structural space that must be explored and filtered. In this study, we present a proof-of-concept transformer based generative chemical language artificial intelligence (AI) model, an innovative end-to-end architecture designed to replace the logic and workflow of the classic CASE framework for ultra-fast and accurate spectroscopic-based structural elucidation. Our model employs an encoder-decoder architecture and self-attention mechanisms, similar to those in large language models, to directly generate the most probable chemical structures that match the input spectroscopic data. Trained on ~ 102 k IR, UV, and 1H NMR spectra, it performs structural elucidation of molecules with up to 29 atoms in just a few seconds on a modern CPU, achieving a top-15 accuracy of 83%. This approach demonstrates the potential of transformer based generative AI to accelerate traditional scientific problem-solving processes. The model's ability to iterate quickly based on new data highlights its potential for rapid advancements in structural elucidation. This study introduces a transformer-based generative AI model as a novel approach to structural elucidation for organic compounds, replacing traditional CASE systems with an end-to-end encoder-decoder architecture. This work demonstrates the potential of transformer models to revolutionize CASE by significantly accelerating the elucidation process and enabling rapid iterations with new data.\",\"PeriodicalId\":617,\"journal\":{\"name\":\"Journal of Cheminformatics\",\"volume\":\"91 18 1\",\"pages\":\"\"},\"PeriodicalIF\":7.1000,\"publicationDate\":\"2025-07-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Cheminformatics\",\"FirstCategoryId\":\"92\",\"ListUrlMain\":\"https://doi.org/10.1186/s13321-025-01016-1\",\"RegionNum\":2,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Cheminformatics","FirstCategoryId":"92","ListUrlMain":"https://doi.org/10.1186/s13321-025-01016-1","RegionNum":2,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

摘要

半个多世纪以来,有机化合物的计算机辅助结构解析系统(CASE)一直依赖于具有明确编程算法的复杂专家系统。由于必须探索和过滤巨大的化学结构空间,这些系统对于复杂化合物的计算效率通常很低。在这项研究中,我们提出了一种基于概念验证转换器的生成化学语言人工智能(AI)模型,这是一种创新的端到端架构,旨在取代经典CASE框架的逻辑和工作流程,实现超快速、准确的基于光谱的结构解析。我们的模型采用编码器-解码器架构和自关注机制,类似于大型语言模型,直接生成与输入光谱数据匹配的最可能的化学结构。在~ 102 k IR, UV和1H NMR光谱上进行训练,它在现代CPU上只需几秒钟即可对多达29个原子的分子进行结构解析,达到83%的前15名精度。这种方法展示了基于变压器的生成式人工智能加速传统科学问题解决过程的潜力。该模型基于新数据的快速迭代能力突出了其在结构阐明方面快速发展的潜力。本研究引入了一种基于变压器的生成式人工智能模型,作为有机化合物结构解析的一种新方法,用端到端编码器-解码器架构取代了传统的CASE系统。这项工作证明了变压器模型的潜力,通过显著加速阐明过程和支持新数据的快速迭代来彻底改变CASE。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
A transformer based generative chemical language AI model for structural elucidation of organic compounds
For over half a century, computer-aided structural elucidation systems (CASE) for organic compounds have relied on complex expert systems with explicitly programmed algorithms. These systems are often computationally inefficient for complex compounds due to the vast chemical structural space that must be explored and filtered. In this study, we present a proof-of-concept transformer based generative chemical language artificial intelligence (AI) model, an innovative end-to-end architecture designed to replace the logic and workflow of the classic CASE framework for ultra-fast and accurate spectroscopic-based structural elucidation. Our model employs an encoder-decoder architecture and self-attention mechanisms, similar to those in large language models, to directly generate the most probable chemical structures that match the input spectroscopic data. Trained on ~ 102 k IR, UV, and 1H NMR spectra, it performs structural elucidation of molecules with up to 29 atoms in just a few seconds on a modern CPU, achieving a top-15 accuracy of 83%. This approach demonstrates the potential of transformer based generative AI to accelerate traditional scientific problem-solving processes. The model's ability to iterate quickly based on new data highlights its potential for rapid advancements in structural elucidation. This study introduces a transformer-based generative AI model as a novel approach to structural elucidation for organic compounds, replacing traditional CASE systems with an end-to-end encoder-decoder architecture. This work demonstrates the potential of transformer models to revolutionize CASE by significantly accelerating the elucidation process and enabling rapid iterations with new data.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Journal of Cheminformatics
Journal of Cheminformatics CHEMISTRY, MULTIDISCIPLINARY-COMPUTER SCIENCE, INFORMATION SYSTEMS
CiteScore
14.10
自引率
7.00%
发文量
82
审稿时长
3 months
期刊介绍: Journal of Cheminformatics is an open access journal publishing original peer-reviewed research in all aspects of cheminformatics and molecular modelling. Coverage includes, but is not limited to: chemical information systems, software and databases, and molecular modelling, chemical structure representations and their use in structure, substructure, and similarity searching of chemical substance and chemical reaction databases, computer and molecular graphics, computer-aided molecular design, expert systems, QSAR, and data mining techniques.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信