大型语言模型是上下文中的分子学习者

IF 10.4 2区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

IEEE Transactions on Knowledge and Data Engineering Pub Date : 2025-04-03 DOI:10.1109/TKDE.2025.3557697

Jiatong Li;Wei Liu;Zhihao Ding;Wenqi Fan;Yuqiang Li;Qing Li

{"title":"大型语言模型是上下文中的分子学习者","authors":"Jiatong Li;Wei Liu;Zhihao Ding;Wenqi Fan;Yuqiang Li;Qing Li","doi":"10.1109/TKDE.2025.3557697","DOIUrl":null,"url":null,"abstract":"Large Language Models (LLMs) have demonstrated exceptional performance in biochemical tasks, especially the molecule caption translation task, which aims to bridge the gap between molecules and natural language texts. However, previous methods in adapting LLMs to the molecule-caption translation task required extra domain-specific pre-training stages, suffered weak alignment between molecular and textual spaces, or imposed stringent demands on the scale of LLMs. To resolve the challenges, we propose <bold>In-<bold>Context <bold>Molecule <bold>Adaptation (<bold>ICMA), as a new paradigm allowing LLMs to learn the molecule-text alignment from context examples via In-Context Molecule Tuning. Specifically, ICMA incorporates the following three stages: Hybrid Context Retrieval, Post-retrieval Re-ranking, and In-context Molecule Tuning. Initially, Hybrid Context Retrieval utilizes BM25 Caption Retrieval and Molecule Graph Retrieval to retrieve similar informative context examples. Additionally, Post-retrieval Re-ranking is composed of Sequence Reversal and Random Walk selection to further improve the quality of retrieval results. Finally, In-Context Molecule Tuning unlocks the in-context learning and reasoning capability of LLMs with the retrieved examples and adapts the parameters of LLMs for better alignment between molecules and texts. Experimental results demonstrate that ICMA can empower LLMs to achieve state-of-the-art or comparable performance without extra training corpora and intricate structures, showing that LLMs are inherently in-context molecule learners.","PeriodicalId":13496,"journal":{"name":"IEEE Transactions on Knowledge and Data Engineering","volume":"37 7","pages":"4131-4143"},"PeriodicalIF":10.4000,"publicationDate":"2025-04-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Large Language Models are in-Context Molecule Learners\",\"authors\":\"Jiatong Li;Wei Liu;Zhihao Ding;Wenqi Fan;Yuqiang Li;Qing Li\",\"doi\":\"10.1109/TKDE.2025.3557697\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large Language Models (LLMs) have demonstrated exceptional performance in biochemical tasks, especially the molecule caption translation task, which aims to bridge the gap between molecules and natural language texts. However, previous methods in adapting LLMs to the molecule-caption translation task required extra domain-specific pre-training stages, suffered weak alignment between molecular and textual spaces, or imposed stringent demands on the scale of LLMs. To resolve the challenges, we propose <bold>In-<bold>Context <bold>Molecule <bold>Adaptation (<bold>ICMA), as a new paradigm allowing LLMs to learn the molecule-text alignment from context examples via In-Context Molecule Tuning. Specifically, ICMA incorporates the following three stages: Hybrid Context Retrieval, Post-retrieval Re-ranking, and In-context Molecule Tuning. Initially, Hybrid Context Retrieval utilizes BM25 Caption Retrieval and Molecule Graph Retrieval to retrieve similar informative context examples. Additionally, Post-retrieval Re-ranking is composed of Sequence Reversal and Random Walk selection to further improve the quality of retrieval results. Finally, In-Context Molecule Tuning unlocks the in-context learning and reasoning capability of LLMs with the retrieved examples and adapts the parameters of LLMs for better alignment between molecules and texts. Experimental results demonstrate that ICMA can empower LLMs to achieve state-of-the-art or comparable performance without extra training corpora and intricate structures, showing that LLMs are inherently in-context molecule learners.\",\"PeriodicalId\":13496,\"journal\":{\"name\":\"IEEE Transactions on Knowledge and Data Engineering\",\"volume\":\"37 7\",\"pages\":\"4131-4143\"},\"PeriodicalIF\":10.4000,\"publicationDate\":\"2025-04-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Knowledge and Data Engineering\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10948482/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Knowledge and Data Engineering","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10948482/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

大型语言模型（llm）在生物化学任务中表现出色，特别是分子标题翻译任务，旨在弥合分子和自然语言文本之间的差距。然而，之前使llm适应分子标题翻译任务的方法需要额外的领域特定的预训练阶段，分子和文本空间之间的对齐很弱，或者对llm的规模提出了严格的要求。为了解决这些挑战，我们提出了上下文分子适应（ICMA），作为一种新的范例，允许法学硕士通过上下文分子调谐从上下文示例中学习分子-文本对齐。具体而言，ICMA包括以下三个阶段：混合上下文检索，检索后重新排序和上下文分子调整。最初，混合上下文检索利用BM25标题检索和分子图检索来检索相似的信息上下文示例。此外，检索后重新排序由序列反转和随机行走选择组成，进一步提高检索结果的质量。最后，In-Context Molecule Tuning利用检索到的示例解锁llm的In-Context学习和推理能力，并调整llm的参数，使分子和文本更好地对齐。实验结果表明，ICMA可以使法学硕士在没有额外训练语料库和复杂结构的情况下获得最先进或相当的性能，这表明法学硕士天生就是上下文中的分子学习者。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Large Language Models are in-Context Molecule Learners

Large Language Models (LLMs) have demonstrated exceptional performance in biochemical tasks, especially the molecule caption translation task, which aims to bridge the gap between molecules and natural language texts. However, previous methods in adapting LLMs to the molecule-caption translation task required extra domain-specific pre-training stages, suffered weak alignment between molecular and textual spaces, or imposed stringent demands on the scale of LLMs. To resolve the challenges, we propose In-Context Molecule Adaptation (ICMA), as a new paradigm allowing LLMs to learn the molecule-text alignment from context examples via In-Context Molecule Tuning. Specifically, ICMA incorporates the following three stages: Hybrid Context Retrieval, Post-retrieval Re-ranking, and In-context Molecule Tuning. Initially, Hybrid Context Retrieval utilizes BM25 Caption Retrieval and Molecule Graph Retrieval to retrieve similar informative context examples. Additionally, Post-retrieval Re-ranking is composed of Sequence Reversal and Random Walk selection to further improve the quality of retrieval results. Finally, In-Context Molecule Tuning unlocks the in-context learning and reasoning capability of LLMs with the retrieved examples and adapts the parameters of LLMs for better alignment between molecules and texts. Experimental results demonstrate that ICMA can empower LLMs to achieve state-of-the-art or comparable performance without extra training corpora and intricate structures, showing that LLMs are inherently in-context molecule learners.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Knowledge and Data Engineering 工程技术-工程：电子与电气

CiteScore

11.70

自引率

3.40%

发文量

515

审稿时长

6 months

期刊介绍： The IEEE Transactions on Knowledge and Data Engineering encompasses knowledge and data engineering aspects within computer science, artificial intelligence, electrical engineering, computer engineering, and related fields. It provides an interdisciplinary platform for disseminating new developments in knowledge and data engineering and explores the practicality of these concepts in both hardware and software. Specific areas covered include knowledge-based and expert systems, AI techniques for knowledge and data management, tools, and methodologies, distributed processing, real-time systems, architectures, data management practices, database design, query languages, security, fault tolerance, statistical databases, algorithms, performance evaluation, and applications.