面向生物医学研究和临床支持的深度思维法学硕士检索增强知识挖掘方法。

IF 11.8 2区生物学 Q1 MULTIDISCIPLINARY SCIENCES

GigaScience Pub Date : 2025-01-06 DOI:10.1093/gigascience/giaf109

Yichun Feng, Jiawei Wang, Ruikun He, Lu Zhou, Yixue Li

{"title":"面向生物医学研究和临床支持的深度思维法学硕士检索增强知识挖掘方法。","authors":"Yichun Feng, Jiawei Wang, Ruikun He, Lu Zhou, Yixue Li","doi":"10.1093/gigascience/giaf109","DOIUrl":null,"url":null,"abstract":"Background: Knowledge graphs and large language models (LLMs) are key tools for biomedical knowledge integration and reasoning, facilitating structured organization of scientific articles and discovery of complex semantic relationships. However, current methods face challenges: knowledge graph construction is limited by complex terminology, data heterogeneity, and rapid knowledge evolution, while LLMs show limitations in retrieval and reasoning, making it difficult to uncover cross-document associations and reasoning pathways.Results: We propose a pipeline that uses LLMs to construct a Biomedical Stratified Knowledge Graph (BioStrataKG) from large-scale articles and builds the Biomedical Cross-Document Question Answering Dataset (BioCDQA) to evaluate latent knowledge retrieval and multihop reasoning. We then introduce Integrated and Progressive Retrieval-Augmented Reasoning (IP-RAR) to enhance retrieval accuracy and knowledge reasoning. IP-RAR maximizes information recall through integrated reasoning-based retrieval and refines knowledge via progressive reasoning-based generation, using self-reflection to achieve deep thinking and precise contextual understanding. Experiments show that IP-RAR improves document retrieval F1 score by 20% and answer generation accuracy by 25% over existing methods.Conclusions: The IP-RAR helps doctors efficiently integrate treatment evidence to inform the development of personalized medication plans and enables researchers to analyze advancements and research gaps, accelerating the hypothesis generation phase of scientific discovery and decision-making.","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8000,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12448786/pdf/","citationCount":"0","resultStr":"{\"title\":\"A retrieval-augmented knowledge mining method with deep thinking LLMs for biomedical research and clinical support.\",\"authors\":\"Yichun Feng, Jiawei Wang, Ruikun He, Lu Zhou, Yixue Li\",\"doi\":\"10.1093/gigascience/giaf109\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Knowledge graphs and large language models (LLMs) are key tools for biomedical knowledge integration and reasoning, facilitating structured organization of scientific articles and discovery of complex semantic relationships. However, current methods face challenges: knowledge graph construction is limited by complex terminology, data heterogeneity, and rapid knowledge evolution, while LLMs show limitations in retrieval and reasoning, making it difficult to uncover cross-document associations and reasoning pathways.Results: We propose a pipeline that uses LLMs to construct a Biomedical Stratified Knowledge Graph (BioStrataKG) from large-scale articles and builds the Biomedical Cross-Document Question Answering Dataset (BioCDQA) to evaluate latent knowledge retrieval and multihop reasoning. We then introduce Integrated and Progressive Retrieval-Augmented Reasoning (IP-RAR) to enhance retrieval accuracy and knowledge reasoning. IP-RAR maximizes information recall through integrated reasoning-based retrieval and refines knowledge via progressive reasoning-based generation, using self-reflection to achieve deep thinking and precise contextual understanding. Experiments show that IP-RAR improves document retrieval F1 score by 20% and answer generation accuracy by 25% over existing methods.Conclusions: The IP-RAR helps doctors efficiently integrate treatment evidence to inform the development of personalized medication plans and enables researchers to analyze advancements and research gaps, accelerating the hypothesis generation phase of scientific discovery and decision-making.\",\"PeriodicalId\":12581,\"journal\":{\"name\":\"GigaScience\",\"volume\":\"14 \",\"pages\":\"\"},\"PeriodicalIF\":11.8000,\"publicationDate\":\"2025-01-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12448786/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"GigaScience\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1093/gigascience/giaf109\",\"RegionNum\":2,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MULTIDISCIPLINARY SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"GigaScience","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/gigascience/giaf109","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}

引用次数: 0

摘要

背景：知识图谱和大型语言模型（llm）是生物医学知识整合和推理的关键工具，有助于科学文章的结构化组织和复杂语义关系的发现。然而，目前的方法面临着挑战：知识图谱的构建受到复杂术语、数据异质性和知识快速演变的限制，而法学硕士在检索和推理方面存在局限性，难以发现跨文档关联和推理途径。结果：我们提出了一个管道，使用llm从大规模文章中构建生物医学分层知识图（BioStrataKG），并构建生物医学跨文档问答数据集（BioCDQA）来评估潜在知识检索和多跳推理。然后，我们引入了集成和渐进式检索增强推理（IP-RAR）来提高检索精度和知识推理。IP-RAR通过基于推理的综合检索实现信息回忆最大化，通过基于推理的递进生成实现知识提炼，通过自我反思实现深度思考和精确语境理解。实验表明，与现有方法相比，IP-RAR将文档检索F1分数提高了20%，答案生成准确率提高了25%。结论：IP-RAR帮助医生有效整合治疗证据，为个性化用药计划的制定提供信息，使研究人员能够分析进展和研究差距，加快科学发现和决策的假设生成阶段。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A retrieval-augmented knowledge mining method with deep thinking LLMs for biomedical research and clinical support.

Background: Knowledge graphs and large language models (LLMs) are key tools for biomedical knowledge integration and reasoning, facilitating structured organization of scientific articles and discovery of complex semantic relationships. However, current methods face challenges: knowledge graph construction is limited by complex terminology, data heterogeneity, and rapid knowledge evolution, while LLMs show limitations in retrieval and reasoning, making it difficult to uncover cross-document associations and reasoning pathways.

Results: We propose a pipeline that uses LLMs to construct a Biomedical Stratified Knowledge Graph (BioStrataKG) from large-scale articles and builds the Biomedical Cross-Document Question Answering Dataset (BioCDQA) to evaluate latent knowledge retrieval and multihop reasoning. We then introduce Integrated and Progressive Retrieval-Augmented Reasoning (IP-RAR) to enhance retrieval accuracy and knowledge reasoning. IP-RAR maximizes information recall through integrated reasoning-based retrieval and refines knowledge via progressive reasoning-based generation, using self-reflection to achieve deep thinking and precise contextual understanding. Experiments show that IP-RAR improves document retrieval F1 score by 20% and answer generation accuracy by 25% over existing methods.

Conclusions: The IP-RAR helps doctors efficiently integrate treatment evidence to inform the development of personalized medication plans and enables researchers to analyze advancements and research gaps, accelerating the hypothesis generation phase of scientific discovery and decision-making.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

GigaScience MULTIDISCIPLINARY SCIENCES-

CiteScore

15.50

自引率

1.10%

发文量

119

审稿时长

1 weeks

期刊介绍： GigaScience seeks to transform data dissemination and utilization in the life and biomedical sciences. As an online open-access open-data journal, it specializes in publishing "big-data" studies encompassing various fields. Its scope includes not only "omic" type data and the fields of high-throughput biology currently serviced by large public repositories, but also the growing range of more difficult-to-access data, such as imaging, neuroscience, ecology, cohort data, systems biology and other new types of large-scale shareable data.