整合大型语言模型和特定领域小型模型的分子图表示学习

arXiv - QuanBio - Biomolecules Pub Date : 2024-08-19 DOI:arxiv-2408.10124

Tianyu Zhang, Yuxiang Ren, Chengbin Hou, Hairong Lv, Xuegong Zhang

{"title":"整合大型语言模型和特定领域小型模型的分子图表示学习","authors":"Tianyu Zhang, Yuxiang Ren, Chengbin Hou, Hairong Lv, Xuegong Zhang","doi":"arxiv-2408.10124","DOIUrl":null,"url":null,"abstract":"Molecular property prediction is a crucial foundation for drug discovery. In\nrecent years, pre-trained deep learning models have been widely applied to this\ntask. Some approaches that incorporate prior biological domain knowledge into\nthe pre-training framework have achieved impressive results. However, these\nmethods heavily rely on biochemical experts, and retrieving and summarizing\nvast amounts of domain knowledge literature is both time-consuming and\nexpensive. Large Language Models (LLMs) have demonstrated remarkable\nperformance in understanding and efficiently providing general knowledge.\nNevertheless, they occasionally exhibit hallucinations and lack precision in\ngenerating domain-specific knowledge. Conversely, Domain-specific Small Models\n(DSMs) possess rich domain knowledge and can accurately calculate molecular\ndomain-related metrics. However, due to their limited model size and singular\nfunctionality, they lack the breadth of knowledge necessary for comprehensive\nrepresentation learning. To leverage the advantages of both approaches in\nmolecular property prediction, we propose a novel Molecular Graph\nrepresentation learning framework that integrates Large language models and\nDomain-specific small models (MolGraph-LarDo). Technically, we design a\ntwo-stage prompt strategy where DSMs are introduced to calibrate the knowledge\nprovided by LLMs, enhancing the accuracy of domain-specific information and\nthus enabling LLMs to generate more precise textual descriptions for molecular\nsamples. Subsequently, we employ a multi-modal alignment method to coordinate\nvarious modalities, including molecular graphs and their corresponding\ndescriptive texts, to guide the pre-training of molecular representations.\nExtensive experiments demonstrate the effectiveness of the proposed method.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"19 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Molecular Graph Representation Learning Integrating Large Language Models with Domain-specific Small Models\",\"authors\":\"Tianyu Zhang, Yuxiang Ren, Chengbin Hou, Hairong Lv, Xuegong Zhang\",\"doi\":\"arxiv-2408.10124\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Molecular property prediction is a crucial foundation for drug discovery. In\\nrecent years, pre-trained deep learning models have been widely applied to this\\ntask. Some approaches that incorporate prior biological domain knowledge into\\nthe pre-training framework have achieved impressive results. However, these\\nmethods heavily rely on biochemical experts, and retrieving and summarizing\\nvast amounts of domain knowledge literature is both time-consuming and\\nexpensive. Large Language Models (LLMs) have demonstrated remarkable\\nperformance in understanding and efficiently providing general knowledge.\\nNevertheless, they occasionally exhibit hallucinations and lack precision in\\ngenerating domain-specific knowledge. Conversely, Domain-specific Small Models\\n(DSMs) possess rich domain knowledge and can accurately calculate molecular\\ndomain-related metrics. However, due to their limited model size and singular\\nfunctionality, they lack the breadth of knowledge necessary for comprehensive\\nrepresentation learning. To leverage the advantages of both approaches in\\nmolecular property prediction, we propose a novel Molecular Graph\\nrepresentation learning framework that integrates Large language models and\\nDomain-specific small models (MolGraph-LarDo). Technically, we design a\\ntwo-stage prompt strategy where DSMs are introduced to calibrate the knowledge\\nprovided by LLMs, enhancing the accuracy of domain-specific information and\\nthus enabling LLMs to generate more precise textual descriptions for molecular\\nsamples. Subsequently, we employ a multi-modal alignment method to coordinate\\nvarious modalities, including molecular graphs and their corresponding\\ndescriptive texts, to guide the pre-training of molecular representations.\\nExtensive experiments demonstrate the effectiveness of the proposed method.\",\"PeriodicalId\":501022,\"journal\":{\"name\":\"arXiv - QuanBio - Biomolecules\",\"volume\":\"19 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Biomolecules\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.10124\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Biomolecules","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.10124","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

分子性质预测是药物发现的重要基础。近年来，预训练的深度学习模型被广泛应用于这一任务。一些将先前生物领域知识纳入预训练框架的方法取得了令人瞩目的成果。然而，这些方法严重依赖生化专家，而检索和总结大量领域知识文献既费时又费钱。大型语言模型（LLM）在理解和有效提供一般知识方面表现出色，但在生成特定领域知识时偶尔会出现幻觉，缺乏精确性。相反，特定领域小模型（DSM）拥有丰富的领域知识，可以精确计算分子领域相关指标。然而，由于模型规模有限且功能单一，它们缺乏全面表征学习所需的广泛知识。为了在分子性质预测中充分利用这两种方法的优势，我们提出了一种新的分子图谱表征学习框架，该框架集成了大语言模型和特定领域小模型（MolGraph-LarDo）。在技术上，我们设计了两阶段提示策略，其中引入 DSM 来校准 LLM 提供的知识，提高特定领域信息的准确性，从而使 LLM 能够为分子样本生成更精确的文本描述。随后，我们采用了一种多模态配准方法来协调各种模态，包括分子图及其相应的描述性文本，以指导分子表征的预训练。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Molecular Graph Representation Learning Integrating Large Language Models with Domain-specific Small Models

Molecular property prediction is a crucial foundation for drug discovery. In recent years, pre-trained deep learning models have been widely applied to this task. Some approaches that incorporate prior biological domain knowledge into the pre-training framework have achieved impressive results. However, these methods heavily rely on biochemical experts, and retrieving and summarizing vast amounts of domain knowledge literature is both time-consuming and expensive. Large Language Models (LLMs) have demonstrated remarkable performance in understanding and efficiently providing general knowledge. Nevertheless, they occasionally exhibit hallucinations and lack precision in generating domain-specific knowledge. Conversely, Domain-specific Small Models (DSMs) possess rich domain knowledge and can accurately calculate molecular domain-related metrics. However, due to their limited model size and singular functionality, they lack the breadth of knowledge necessary for comprehensive representation learning. To leverage the advantages of both approaches in molecular property prediction, we propose a novel Molecular Graph representation learning framework that integrates Large language models and Domain-specific small models (MolGraph-LarDo). Technically, we design a two-stage prompt strategy where DSMs are introduced to calibrate the knowledge provided by LLMs, enhancing the accuracy of domain-specific information and thus enabling LLMs to generate more precise textual descriptions for molecular samples. Subsequently, we employ a multi-modal alignment method to coordinate various modalities, including molecular graphs and their corresponding descriptive texts, to guide the pre-training of molecular representations. Extensive experiments demonstrate the effectiveness of the proposed method.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - QuanBio - Biomolecules

自引率

0.00%

发文量