{"title":"Token-Mol 1.0:使用大型语言模型的标记化药物设计","authors":"Jike Wang, Rui Qin, Mingyang Wang, Meijing Fang, Yangyang Zhang, Yuchen Zhu, Qun Su, Qiaolin Gou, Chao Shen, Odin Zhang, Zhenxing Wu, Dejun Jiang, Xujun Zhang, Huifeng Zhao, Xiaozhe Wan, Zhourui Wu, Liwei Liu, Yu Kang, Chang-Yu Hsieh, Tingjun Hou","doi":"arxiv-2407.07930","DOIUrl":null,"url":null,"abstract":"Significant interests have recently risen in leveraging sequence-based large\nlanguage models (LLMs) for drug design. However, most current applications of\nLLMs in drug discovery lack the ability to comprehend three-dimensional (3D)\nstructures, thereby limiting their effectiveness in tasks that explicitly\ninvolve molecular conformations. In this study, we introduced Token-Mol, a\ntoken-only 3D drug design model. This model encodes all molecular information,\nincluding 2D and 3D structures, as well as molecular property data, into\ntokens, which transforms classification and regression tasks in drug discovery\ninto probabilistic prediction problems, thereby enabling learning through a\nunified paradigm. Token-Mol is built on the transformer decoder architecture\nand trained using random causal masking techniques. Additionally, we proposed\nthe Gaussian cross-entropy (GCE) loss function to overcome the challenges in\nregression tasks, significantly enhancing the capacity of LLMs to learn\ncontinuous numerical values. Through a combination of fine-tuning and\nreinforcement learning (RL), Token-Mol achieves performance comparable to or\nsurpassing existing task-specific methods across various downstream tasks,\nincluding pocket-based molecular generation, conformation generation, and\nmolecular property prediction. Compared to existing molecular pre-trained\nmodels, Token-Mol exhibits superior proficiency in handling a wider range of\ndownstream tasks essential for drug design. Notably, our approach improves\nregression task accuracy by approximately 30% compared to similar token-only\nmethods. Token-Mol overcomes the precision limitations of token-only models and\nhas the potential to integrate seamlessly with general models such as ChatGPT,\npaving the way for the development of a universal artificial intelligence drug\ndesign model that facilitates rapid and high-quality drug design by experts.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"57 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Token-Mol 1.0: Tokenized drug design with large language model\",\"authors\":\"Jike Wang, Rui Qin, Mingyang Wang, Meijing Fang, Yangyang Zhang, Yuchen Zhu, Qun Su, Qiaolin Gou, Chao Shen, Odin Zhang, Zhenxing Wu, Dejun Jiang, Xujun Zhang, Huifeng Zhao, Xiaozhe Wan, Zhourui Wu, Liwei Liu, Yu Kang, Chang-Yu Hsieh, Tingjun Hou\",\"doi\":\"arxiv-2407.07930\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Significant interests have recently risen in leveraging sequence-based large\\nlanguage models (LLMs) for drug design. However, most current applications of\\nLLMs in drug discovery lack the ability to comprehend three-dimensional (3D)\\nstructures, thereby limiting their effectiveness in tasks that explicitly\\ninvolve molecular conformations. In this study, we introduced Token-Mol, a\\ntoken-only 3D drug design model. This model encodes all molecular information,\\nincluding 2D and 3D structures, as well as molecular property data, into\\ntokens, which transforms classification and regression tasks in drug discovery\\ninto probabilistic prediction problems, thereby enabling learning through a\\nunified paradigm. Token-Mol is built on the transformer decoder architecture\\nand trained using random causal masking techniques. Additionally, we proposed\\nthe Gaussian cross-entropy (GCE) loss function to overcome the challenges in\\nregression tasks, significantly enhancing the capacity of LLMs to learn\\ncontinuous numerical values. Through a combination of fine-tuning and\\nreinforcement learning (RL), Token-Mol achieves performance comparable to or\\nsurpassing existing task-specific methods across various downstream tasks,\\nincluding pocket-based molecular generation, conformation generation, and\\nmolecular property prediction. Compared to existing molecular pre-trained\\nmodels, Token-Mol exhibits superior proficiency in handling a wider range of\\ndownstream tasks essential for drug design. Notably, our approach improves\\nregression task accuracy by approximately 30% compared to similar token-only\\nmethods. Token-Mol overcomes the precision limitations of token-only models and\\nhas the potential to integrate seamlessly with general models such as ChatGPT,\\npaving the way for the development of a universal artificial intelligence drug\\ndesign model that facilitates rapid and high-quality drug design by experts.\",\"PeriodicalId\":501022,\"journal\":{\"name\":\"arXiv - QuanBio - Biomolecules\",\"volume\":\"57 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Biomolecules\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2407.07930\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Biomolecules","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.07930","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Token-Mol 1.0: Tokenized drug design with large language model
Significant interests have recently risen in leveraging sequence-based large
language models (LLMs) for drug design. However, most current applications of
LLMs in drug discovery lack the ability to comprehend three-dimensional (3D)
structures, thereby limiting their effectiveness in tasks that explicitly
involve molecular conformations. In this study, we introduced Token-Mol, a
token-only 3D drug design model. This model encodes all molecular information,
including 2D and 3D structures, as well as molecular property data, into
tokens, which transforms classification and regression tasks in drug discovery
into probabilistic prediction problems, thereby enabling learning through a
unified paradigm. Token-Mol is built on the transformer decoder architecture
and trained using random causal masking techniques. Additionally, we proposed
the Gaussian cross-entropy (GCE) loss function to overcome the challenges in
regression tasks, significantly enhancing the capacity of LLMs to learn
continuous numerical values. Through a combination of fine-tuning and
reinforcement learning (RL), Token-Mol achieves performance comparable to or
surpassing existing task-specific methods across various downstream tasks,
including pocket-based molecular generation, conformation generation, and
molecular property prediction. Compared to existing molecular pre-trained
models, Token-Mol exhibits superior proficiency in handling a wider range of
downstream tasks essential for drug design. Notably, our approach improves
regression task accuracy by approximately 30% compared to similar token-only
methods. Token-Mol overcomes the precision limitations of token-only models and
has the potential to integrate seamlessly with general models such as ChatGPT,
paving the way for the development of a universal artificial intelligence drug
design model that facilitates rapid and high-quality drug design by experts.