Token-Mol 1.0：使用大型语言模型的标记化药物设计

arXiv - QuanBio - Biomolecules Pub Date : 2024-07-10 DOI:arxiv-2407.07930

Jike Wang, Rui Qin, Mingyang Wang, Meijing Fang, Yangyang Zhang, Yuchen Zhu, Qun Su, Qiaolin Gou, Chao Shen, Odin Zhang, Zhenxing Wu, Dejun Jiang, Xujun Zhang, Huifeng Zhao, Xiaozhe Wan, Zhourui Wu, Liwei Liu, Yu Kang, Chang-Yu Hsieh, Tingjun Hou

{"title":"Token-Mol 1.0：使用大型语言模型的标记化药物设计","authors":"Jike Wang, Rui Qin, Mingyang Wang, Meijing Fang, Yangyang Zhang, Yuchen Zhu, Qun Su, Qiaolin Gou, Chao Shen, Odin Zhang, Zhenxing Wu, Dejun Jiang, Xujun Zhang, Huifeng Zhao, Xiaozhe Wan, Zhourui Wu, Liwei Liu, Yu Kang, Chang-Yu Hsieh, Tingjun Hou","doi":"arxiv-2407.07930","DOIUrl":null,"url":null,"abstract":"Significant interests have recently risen in leveraging sequence-based large\nlanguage models (LLMs) for drug design. However, most current applications of\nLLMs in drug discovery lack the ability to comprehend three-dimensional (3D)\nstructures, thereby limiting their effectiveness in tasks that explicitly\ninvolve molecular conformations. In this study, we introduced Token-Mol, a\ntoken-only 3D drug design model. This model encodes all molecular information,\nincluding 2D and 3D structures, as well as molecular property data, into\ntokens, which transforms classification and regression tasks in drug discovery\ninto probabilistic prediction problems, thereby enabling learning through a\nunified paradigm. Token-Mol is built on the transformer decoder architecture\nand trained using random causal masking techniques. Additionally, we proposed\nthe Gaussian cross-entropy (GCE) loss function to overcome the challenges in\nregression tasks, significantly enhancing the capacity of LLMs to learn\ncontinuous numerical values. Through a combination of fine-tuning and\nreinforcement learning (RL), Token-Mol achieves performance comparable to or\nsurpassing existing task-specific methods across various downstream tasks,\nincluding pocket-based molecular generation, conformation generation, and\nmolecular property prediction. Compared to existing molecular pre-trained\nmodels, Token-Mol exhibits superior proficiency in handling a wider range of\ndownstream tasks essential for drug design. Notably, our approach improves\nregression task accuracy by approximately 30% compared to similar token-only\nmethods. Token-Mol overcomes the precision limitations of token-only models and\nhas the potential to integrate seamlessly with general models such as ChatGPT,\npaving the way for the development of a universal artificial intelligence drug\ndesign model that facilitates rapid and high-quality drug design by experts.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"57 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Token-Mol 1.0: Tokenized drug design with large language model\",\"authors\":\"Jike Wang, Rui Qin, Mingyang Wang, Meijing Fang, Yangyang Zhang, Yuchen Zhu, Qun Su, Qiaolin Gou, Chao Shen, Odin Zhang, Zhenxing Wu, Dejun Jiang, Xujun Zhang, Huifeng Zhao, Xiaozhe Wan, Zhourui Wu, Liwei Liu, Yu Kang, Chang-Yu Hsieh, Tingjun Hou\",\"doi\":\"arxiv-2407.07930\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Significant interests have recently risen in leveraging sequence-based large\\nlanguage models (LLMs) for drug design. However, most current applications of\\nLLMs in drug discovery lack the ability to comprehend three-dimensional (3D)\\nstructures, thereby limiting their effectiveness in tasks that explicitly\\ninvolve molecular conformations. In this study, we introduced Token-Mol, a\\ntoken-only 3D drug design model. This model encodes all molecular information,\\nincluding 2D and 3D structures, as well as molecular property data, into\\ntokens, which transforms classification and regression tasks in drug discovery\\ninto probabilistic prediction problems, thereby enabling learning through a\\nunified paradigm. Token-Mol is built on the transformer decoder architecture\\nand trained using random causal masking techniques. Additionally, we proposed\\nthe Gaussian cross-entropy (GCE) loss function to overcome the challenges in\\nregression tasks, significantly enhancing the capacity of LLMs to learn\\ncontinuous numerical values. Through a combination of fine-tuning and\\nreinforcement learning (RL), Token-Mol achieves performance comparable to or\\nsurpassing existing task-specific methods across various downstream tasks,\\nincluding pocket-based molecular generation, conformation generation, and\\nmolecular property prediction. Compared to existing molecular pre-trained\\nmodels, Token-Mol exhibits superior proficiency in handling a wider range of\\ndownstream tasks essential for drug design. Notably, our approach improves\\nregression task accuracy by approximately 30% compared to similar token-only\\nmethods. Token-Mol overcomes the precision limitations of token-only models and\\nhas the potential to integrate seamlessly with general models such as ChatGPT,\\npaving the way for the development of a universal artificial intelligence drug\\ndesign model that facilitates rapid and high-quality drug design by experts.\",\"PeriodicalId\":501022,\"journal\":{\"name\":\"arXiv - QuanBio - Biomolecules\",\"volume\":\"57 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Biomolecules\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2407.07930\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Biomolecules","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.07930","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

近来，人们对利用基于序列的大语言模型（LLMs）进行药物设计产生了浓厚的兴趣。然而，目前大多数应用于药物发现的 LLMs 缺乏理解三维（3D）结构的能力，从而限制了它们在明确涉及分子构象的任务中的有效性。在这项研究中，我们引入了 Token-Mol，这是一种只含令牌的三维药物设计模型。该模型将包括二维和三维结构在内的所有分子信息以及分子性质数据编码为令牌，将药物发现中的分类和回归任务转化为概率预测问题，从而通过统一的范式实现学习。Token-Mol 建立在变压器解码器架构上，并使用随机因果掩蔽技术进行训练。此外，我们还提出了高斯交叉熵（GCE）损失函数，以克服回归任务中的挑战，大大提高了 LLMs 学习连续数值的能力。通过微调和强化学习（RL）的结合，Token-Mol 在各种下游任务（包括基于口袋的分子生成、构象生成和分子性质预测）中取得了媲美或超越现有特定任务方法的性能。与现有的分子预训练模型相比，Token-Mol 在处理药物设计所必需的更广泛的下游任务方面表现出卓越的能力。值得注意的是，与类似的纯标记方法相比，我们的方法提高了约 30% 的回归任务准确性。Token-Mol 克服了纯标记模型的精度限制，并有可能与 ChatGPT 等通用模型无缝集成，为开发通用人工智能药物设计模型铺平了道路，有助于专家快速、高质量地进行药物设计。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Token-Mol 1.0: Tokenized drug design with large language model

Significant interests have recently risen in leveraging sequence-based large language models (LLMs) for drug design. However, most current applications of LLMs in drug discovery lack the ability to comprehend three-dimensional (3D) structures, thereby limiting their effectiveness in tasks that explicitly involve molecular conformations. In this study, we introduced Token-Mol, a token-only 3D drug design model. This model encodes all molecular information, including 2D and 3D structures, as well as molecular property data, into tokens, which transforms classification and regression tasks in drug discovery into probabilistic prediction problems, thereby enabling learning through a unified paradigm. Token-Mol is built on the transformer decoder architecture and trained using random causal masking techniques. Additionally, we proposed the Gaussian cross-entropy (GCE) loss function to overcome the challenges in regression tasks, significantly enhancing the capacity of LLMs to learn continuous numerical values. Through a combination of fine-tuning and reinforcement learning (RL), Token-Mol achieves performance comparable to or surpassing existing task-specific methods across various downstream tasks, including pocket-based molecular generation, conformation generation, and molecular property prediction. Compared to existing molecular pre-trained models, Token-Mol exhibits superior proficiency in handling a wider range of downstream tasks essential for drug design. Notably, our approach improves regression task accuracy by approximately 30% compared to similar token-only methods. Token-Mol overcomes the precision limitations of token-only models and has the potential to integrate seamlessly with general models such as ChatGPT, paving the way for the development of a universal artificial intelligence drug design model that facilitates rapid and high-quality drug design by experts.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - QuanBio - Biomolecules

自引率

0.00%

发文量