{"title":"整合大型语言模型和特定领域小型模型的分子图表示学习","authors":"Tianyu Zhang, Yuxiang Ren, Chengbin Hou, Hairong Lv, Xuegong Zhang","doi":"arxiv-2408.10124","DOIUrl":null,"url":null,"abstract":"Molecular property prediction is a crucial foundation for drug discovery. In\nrecent years, pre-trained deep learning models have been widely applied to this\ntask. Some approaches that incorporate prior biological domain knowledge into\nthe pre-training framework have achieved impressive results. However, these\nmethods heavily rely on biochemical experts, and retrieving and summarizing\nvast amounts of domain knowledge literature is both time-consuming and\nexpensive. Large Language Models (LLMs) have demonstrated remarkable\nperformance in understanding and efficiently providing general knowledge.\nNevertheless, they occasionally exhibit hallucinations and lack precision in\ngenerating domain-specific knowledge. Conversely, Domain-specific Small Models\n(DSMs) possess rich domain knowledge and can accurately calculate molecular\ndomain-related metrics. However, due to their limited model size and singular\nfunctionality, they lack the breadth of knowledge necessary for comprehensive\nrepresentation learning. To leverage the advantages of both approaches in\nmolecular property prediction, we propose a novel Molecular Graph\nrepresentation learning framework that integrates Large language models and\nDomain-specific small models (MolGraph-LarDo). Technically, we design a\ntwo-stage prompt strategy where DSMs are introduced to calibrate the knowledge\nprovided by LLMs, enhancing the accuracy of domain-specific information and\nthus enabling LLMs to generate more precise textual descriptions for molecular\nsamples. Subsequently, we employ a multi-modal alignment method to coordinate\nvarious modalities, including molecular graphs and their corresponding\ndescriptive texts, to guide the pre-training of molecular representations.\nExtensive experiments demonstrate the effectiveness of the proposed method.","PeriodicalId":501022,"journal":{"name":"arXiv - QuanBio - Biomolecules","volume":"19 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Molecular Graph Representation Learning Integrating Large Language Models with Domain-specific Small Models\",\"authors\":\"Tianyu Zhang, Yuxiang Ren, Chengbin Hou, Hairong Lv, Xuegong Zhang\",\"doi\":\"arxiv-2408.10124\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Molecular property prediction is a crucial foundation for drug discovery. In\\nrecent years, pre-trained deep learning models have been widely applied to this\\ntask. Some approaches that incorporate prior biological domain knowledge into\\nthe pre-training framework have achieved impressive results. However, these\\nmethods heavily rely on biochemical experts, and retrieving and summarizing\\nvast amounts of domain knowledge literature is both time-consuming and\\nexpensive. Large Language Models (LLMs) have demonstrated remarkable\\nperformance in understanding and efficiently providing general knowledge.\\nNevertheless, they occasionally exhibit hallucinations and lack precision in\\ngenerating domain-specific knowledge. Conversely, Domain-specific Small Models\\n(DSMs) possess rich domain knowledge and can accurately calculate molecular\\ndomain-related metrics. However, due to their limited model size and singular\\nfunctionality, they lack the breadth of knowledge necessary for comprehensive\\nrepresentation learning. To leverage the advantages of both approaches in\\nmolecular property prediction, we propose a novel Molecular Graph\\nrepresentation learning framework that integrates Large language models and\\nDomain-specific small models (MolGraph-LarDo). Technically, we design a\\ntwo-stage prompt strategy where DSMs are introduced to calibrate the knowledge\\nprovided by LLMs, enhancing the accuracy of domain-specific information and\\nthus enabling LLMs to generate more precise textual descriptions for molecular\\nsamples. Subsequently, we employ a multi-modal alignment method to coordinate\\nvarious modalities, including molecular graphs and their corresponding\\ndescriptive texts, to guide the pre-training of molecular representations.\\nExtensive experiments demonstrate the effectiveness of the proposed method.\",\"PeriodicalId\":501022,\"journal\":{\"name\":\"arXiv - QuanBio - Biomolecules\",\"volume\":\"19 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-08-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - QuanBio - Biomolecules\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2408.10124\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - QuanBio - Biomolecules","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2408.10124","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Molecular Graph Representation Learning Integrating Large Language Models with Domain-specific Small Models
Molecular property prediction is a crucial foundation for drug discovery. In
recent years, pre-trained deep learning models have been widely applied to this
task. Some approaches that incorporate prior biological domain knowledge into
the pre-training framework have achieved impressive results. However, these
methods heavily rely on biochemical experts, and retrieving and summarizing
vast amounts of domain knowledge literature is both time-consuming and
expensive. Large Language Models (LLMs) have demonstrated remarkable
performance in understanding and efficiently providing general knowledge.
Nevertheless, they occasionally exhibit hallucinations and lack precision in
generating domain-specific knowledge. Conversely, Domain-specific Small Models
(DSMs) possess rich domain knowledge and can accurately calculate molecular
domain-related metrics. However, due to their limited model size and singular
functionality, they lack the breadth of knowledge necessary for comprehensive
representation learning. To leverage the advantages of both approaches in
molecular property prediction, we propose a novel Molecular Graph
representation learning framework that integrates Large language models and
Domain-specific small models (MolGraph-LarDo). Technically, we design a
two-stage prompt strategy where DSMs are introduced to calibrate the knowledge
provided by LLMs, enhancing the accuracy of domain-specific information and
thus enabling LLMs to generate more precise textual descriptions for molecular
samples. Subsequently, we employ a multi-modal alignment method to coordinate
various modalities, including molecular graphs and their corresponding
descriptive texts, to guide the pre-training of molecular representations.
Extensive experiments demonstrate the effectiveness of the proposed method.