Token-Mol 1.0: tokenized drug design with large language models

IF 14.7 1区综合性期刊 Q1 MULTIDISCIPLINARY SCIENCES

Nature Communications Pub Date : 2025-05-13 DOI:10.1038/s41467-025-59628-y

Jike Wang, Rui Qin, Mingyang Wang, Meijing Fang, Yangyang Zhang, Yuchen Zhu, Qun Su, Qiaolin Gou, Chao Shen, Odin Zhang, Zhenxing Wu, Dejun Jiang, Xujun Zhang, Huifeng Zhao, Jingxuan Ge, Zhourui Wu, Yu Kang, Chang-Yu Hsieh, Tingjun Hou

{"title":"Token-Mol 1.0: tokenized drug design with large language models","authors":"Jike Wang, Rui Qin, Mingyang Wang, Meijing Fang, Yangyang Zhang, Yuchen Zhu, Qun Su, Qiaolin Gou, Chao Shen, Odin Zhang, Zhenxing Wu, Dejun Jiang, Xujun Zhang, Huifeng Zhao, Jingxuan Ge, Zhourui Wu, Yu Kang, Chang-Yu Hsieh, Tingjun Hou","doi":"10.1038/s41467-025-59628-y","DOIUrl":null,"url":null,"abstract":"<p>The integration of large language models (LLMs) into drug design is gaining momentum; however, existing approaches often struggle to effectively incorporate three-dimensional molecular structures. Here, we present Token-Mol, a token-only 3D drug design model that encodes both 2D and 3D structural information, along with molecular properties, into discrete tokens. Built on a transformer decoder and trained with causal masking, Token-Mol introduces a Gaussian cross-entropy loss function tailored for regression tasks, enabling superior performance across multiple downstream applications. The model surpasses existing methods, improving molecular conformation generation by over 10% and 20% across two datasets, while outperforming token-only models by 30% in property prediction. In pocket-based molecular generation, it enhances drug-likeness and synthetic accessibility by approximately 11% and 14%, respectively. Notably, Token-Mol operates 35 times faster than expert diffusion models. In real-world validation, it improves success rates and, when combined with reinforcement learning, further optimizes affinity and drug-likeness, advancing AI-driven drug discovery.</p>","PeriodicalId":19066,"journal":{"name":"Nature Communications","volume":"20 1","pages":""},"PeriodicalIF":14.7000,"publicationDate":"2025-05-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Nature Communications","FirstCategoryId":"103","ListUrlMain":"https://doi.org/10.1038/s41467-025-59628-y","RegionNum":1,"RegionCategory":"综合性期刊","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

Abstract

The integration of large language models (LLMs) into drug design is gaining momentum; however, existing approaches often struggle to effectively incorporate three-dimensional molecular structures. Here, we present Token-Mol, a token-only 3D drug design model that encodes both 2D and 3D structural information, along with molecular properties, into discrete tokens. Built on a transformer decoder and trained with causal masking, Token-Mol introduces a Gaussian cross-entropy loss function tailored for regression tasks, enabling superior performance across multiple downstream applications. The model surpasses existing methods, improving molecular conformation generation by over 10% and 20% across two datasets, while outperforming token-only models by 30% in property prediction. In pocket-based molecular generation, it enhances drug-likeness and synthetic accessibility by approximately 11% and 14%, respectively. Notably, Token-Mol operates 35 times faster than expert diffusion models. In real-world validation, it improves success rates and, when combined with reinforcement learning, further optimizes affinity and drug-likeness, advancing AI-driven drug discovery.

Abstract Image

查看原文本刊更多论文

Token-Mol 1.0：使用大型语言模型的标记化药物设计

将大型语言模型（LLMs）集成到药物设计中正在获得动力；然而，现有的方法往往难以有效地结合三维分子结构。在这里，我们提出了Token-Mol，这是一种仅使用token的3D药物设计模型，它将2D和3D结构信息以及分子特性编码为离散的token。Token-Mol基于变压器解码器并经过因果掩蔽训练，引入了为回归任务量身定制的高斯交叉熵损失函数，在多个下游应用中实现了卓越的性能。该模型超越了现有的方法，在两个数据集上提高了10%和20%以上的分子构象生成，而在属性预测方面比仅标记的模型高出30%。在基于口袋的分子生成中，它分别提高了药物相似性和合成可及性约11%和14%。值得注意的是，Token-Mol的运行速度比专家扩散模型快35倍。在现实世界的验证中，它提高了成功率，并与强化学习相结合，进一步优化亲和力和药物相似性，推进人工智能驱动的药物发现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Nature Communications Biological Science Disciplines-

CiteScore

24.90

自引率

2.40%

发文量

6928

审稿时长

3.7 months

期刊介绍： Nature Communications, an open-access journal, publishes high-quality research spanning all areas of the natural sciences. Papers featured in the journal showcase significant advances relevant to specialists in each respective field. With a 2-year impact factor of 16.6 (2022) and a median time of 8 days from submission to the first editorial decision, Nature Communications is committed to rapid dissemination of research findings. As a multidisciplinary journal, it welcomes contributions from biological, health, physical, chemical, Earth, social, mathematical, applied, and engineering sciences, aiming to highlight important breakthroughs within each domain.