Token-Mol 1.0:使用大型语言模型的标记化药物设计

Jike Wang, Rui Qin, Mingyang Wang, Meijing Fang, Yangyang Zhang, Yuchen Zhu, Qun Su, Qiaolin Gou, Chao Shen, Odin Zhang, Zhenxing Wu, Dejun Jiang, Xujun Zhang, Huifeng Zhao, Xiaozhe Wan, Zhourui Wu, Liwei Liu, Yu Kang, Chang-Yu Hsieh, Tingjun Hou
{"title":"Token-Mol 1.0:使用大型语言模型的标记化药物设计","authors":"Jike Wang, Rui Qin, Mingyang Wang, Meijing Fang, Yangyang Zhang, Yuchen Zhu, Qun Su, Qiaolin Gou, Chao Shen, Odin Zhang, Zhenxing Wu, Dejun Jiang, Xujun Zhang, Huifeng Zhao, Xiaozhe Wan, Zhourui Wu, Liwei Liu, Yu Kang, Chang-Yu Hsieh, Tingjun Hou","doi":"arxiv-2407.07930","DOIUrl":null,"url":null,"abstract":"Significant interests have recently risen in leveraging sequence-based large\nlanguage models (LLMs) for drug design. However, most current applications of\nLLMs in drug discovery lack the ability to comprehend three-dimensional (3D)\nstructures, thereby limiting their effectiveness in tasks that explicitly\ninvolve molecular conformations. In this study, we introduced Token-Mol, a\ntoken-only 3D drug design model. This model encodes all molecular information,\nincluding 2D and 3D structures, as well as molecular property data, into\ntokens, which transforms classification and regression tasks in drug discovery\ninto probabilistic prediction problems, thereby enabling learning through a\nunified paradigm. Token-Mol is built on the transformer decoder architecture\nand trained using random causal masking techniques. Additionally, we proposed\nthe Gaussian cross-entropy (GCE) loss function to overcome the challenges in\nregression tasks, significantly enhancing the capacity of LLMs to learn\ncontinuous numerical values. Through a combination of fine-tuning and\nreinforcement learning (RL), Token-Mol achieves performance comparable to or\nsurpassing existing task-specific methods across various downstream tasks,\nincluding pocket-based molecular generation, conformation generation, and\nmolecular property prediction. Compared to existing molecular pre-trained\nmodels, Token-Mol exhibits superior proficiency in handling a wider range of\ndownstream tasks essential for drug design. Notably, our approach improves\nregression task accuracy by approximately 30% compared to similar token-only\nmethods. Token-Mol overcomes the precision limitations of token-only models and\nhas the potential to integrate seamlessly with general models such as ChatGPT,\npaving the way for the development of a universal artificial intelligence drug\ndesign model that facilitates rapid and high-quality drug design by experts.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"57 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Token-Mol 1.0: Tokenized drug design with large language model\",\"authors\":\"Jike Wang, Rui Qin, Mingyang Wang, Meijing Fang, Yangyang Zhang, Yuchen Zhu, Qun Su, Qiaolin Gou, Chao Shen, Odin Zhang, Zhenxing Wu, Dejun Jiang, Xujun Zhang, Huifeng Zhao, Xiaozhe Wan, Zhourui Wu, Liwei Liu, Yu Kang, Chang-Yu Hsieh, Tingjun Hou\",\"doi\":\"arxiv-2407.07930\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Significant interests have recently risen in leveraging sequence-based large\\nlanguage models (LLMs) for drug design. However, most current applications of\\nLLMs in drug discovery lack the ability to comprehend three-dimensional (3D)\\nstructures, thereby limiting their effectiveness in tasks that explicitly\\ninvolve molecular conformations. In this study, we introduced Token-Mol, a\\ntoken-only 3D drug design model. This model encodes all molecular information,\\nincluding 2D and 3D structures, as well as molecular property data, into\\ntokens, which transforms classification and regression tasks in drug discovery\\ninto probabilistic prediction problems, thereby enabling learning through a\\nunified paradigm. Token-Mol is built on the transformer decoder architecture\\nand trained using random causal masking techniques. Additionally, we proposed\\nthe Gaussian cross-entropy (GCE) loss function to overcome the challenges in\\nregression tasks, significantly enhancing the capacity of LLMs to learn\\ncontinuous numerical values. Through a combination of fine-tuning and\\nreinforcement learning (RL), Token-Mol achieves performance comparable to or\\nsurpassing existing task-specific methods across various downstream tasks,\\nincluding pocket-based molecular generation, conformation generation, and\\nmolecular property prediction. Compared to existing molecular pre-trained\\nmodels, Token-Mol exhibits superior proficiency in handling a wider range of\\ndownstream tasks essential for drug design. Notably, our approach improves\\nregression task accuracy by approximately 30% compared to similar token-only\\nmethods. Token-Mol overcomes the precision limitations of token-only models and\\nhas the potential to integrate seamlessly with general models such as ChatGPT,\\npaving the way for the development of a universal artificial intelligence drug\\ndesign model that facilitates rapid and high-quality drug design by experts.\",\"PeriodicalId\":501022,\"journal\":{\"name\":\"arXiv - QuanBio - Biomolecules\",\"volume\":\"57 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Biomolecules\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2407.07930\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Biomolecules","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.07930","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

近来,人们对利用基于序列的大语言模型(LLMs)进行药物设计产生了浓厚的兴趣。然而,目前大多数应用于药物发现的 LLMs 缺乏理解三维(3D)结构的能力,从而限制了它们在明确涉及分子构象的任务中的有效性。在这项研究中,我们引入了 Token-Mol,这是一种只含令牌的三维药物设计模型。该模型将包括二维和三维结构在内的所有分子信息以及分子性质数据编码为令牌,将药物发现中的分类和回归任务转化为概率预测问题,从而通过统一的范式实现学习。Token-Mol 建立在变压器解码器架构上,并使用随机因果掩蔽技术进行训练。此外,我们还提出了高斯交叉熵(GCE)损失函数,以克服回归任务中的挑战,大大提高了 LLMs 学习连续数值的能力。通过微调和强化学习(RL)的结合,Token-Mol 在各种下游任务(包括基于口袋的分子生成、构象生成和分子性质预测)中取得了媲美或超越现有特定任务方法的性能。与现有的分子预训练模型相比,Token-Mol 在处理药物设计所必需的更广泛的下游任务方面表现出卓越的能力。值得注意的是,与类似的纯标记方法相比,我们的方法提高了约 30% 的回归任务准确性。Token-Mol 克服了纯标记模型的精度限制,并有可能与 ChatGPT 等通用模型无缝集成,为开发通用人工智能药物设计模型铺平了道路,有助于专家快速、高质量地进行药物设计。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Token-Mol 1.0: Tokenized drug design with large language model
Significant interests have recently risen in leveraging sequence-based large language models (LLMs) for drug design. However, most current applications of LLMs in drug discovery lack the ability to comprehend three-dimensional (3D) structures, thereby limiting their effectiveness in tasks that explicitly involve molecular conformations. In this study, we introduced Token-Mol, a token-only 3D drug design model. This model encodes all molecular information, including 2D and 3D structures, as well as molecular property data, into tokens, which transforms classification and regression tasks in drug discovery into probabilistic prediction problems, thereby enabling learning through a unified paradigm. Token-Mol is built on the transformer decoder architecture and trained using random causal masking techniques. Additionally, we proposed the Gaussian cross-entropy (GCE) loss function to overcome the challenges in regression tasks, significantly enhancing the capacity of LLMs to learn continuous numerical values. Through a combination of fine-tuning and reinforcement learning (RL), Token-Mol achieves performance comparable to or surpassing existing task-specific methods across various downstream tasks, including pocket-based molecular generation, conformation generation, and molecular property prediction. Compared to existing molecular pre-trained models, Token-Mol exhibits superior proficiency in handling a wider range of downstream tasks essential for drug design. Notably, our approach improves regression task accuracy by approximately 30% compared to similar token-only methods. Token-Mol overcomes the precision limitations of token-only models and has the potential to integrate seamlessly with general models such as ChatGPT, paving the way for the development of a universal artificial intelligence drug design model that facilitates rapid and high-quality drug design by experts.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信