Yichun Feng, Jiawei Wang, Ruikun He, Lu Zhou, Yixue Li
{"title":"面向生物医学研究和临床支持的深度思维法学硕士检索增强知识挖掘方法。","authors":"Yichun Feng, Jiawei Wang, Ruikun He, Lu Zhou, Yixue Li","doi":"10.1093/gigascience/giaf109","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Knowledge graphs and large language models (LLMs) are key tools for biomedical knowledge integration and reasoning, facilitating structured organization of scientific articles and discovery of complex semantic relationships. However, current methods face challenges: knowledge graph construction is limited by complex terminology, data heterogeneity, and rapid knowledge evolution, while LLMs show limitations in retrieval and reasoning, making it difficult to uncover cross-document associations and reasoning pathways.</p><p><strong>Results: </strong>We propose a pipeline that uses LLMs to construct a Biomedical Stratified Knowledge Graph (BioStrataKG) from large-scale articles and builds the Biomedical Cross-Document Question Answering Dataset (BioCDQA) to evaluate latent knowledge retrieval and multihop reasoning. We then introduce Integrated and Progressive Retrieval-Augmented Reasoning (IP-RAR) to enhance retrieval accuracy and knowledge reasoning. IP-RAR maximizes information recall through integrated reasoning-based retrieval and refines knowledge via progressive reasoning-based generation, using self-reflection to achieve deep thinking and precise contextual understanding. Experiments show that IP-RAR improves document retrieval F1 score by 20% and answer generation accuracy by 25% over existing methods.</p><p><strong>Conclusions: </strong>The IP-RAR helps doctors efficiently integrate treatment evidence to inform the development of personalized medication plans and enables researchers to analyze advancements and research gaps, accelerating the hypothesis generation phase of scientific discovery and decision-making.</p>","PeriodicalId":12581,"journal":{"name":"GigaScience","volume":"14 ","pages":""},"PeriodicalIF":11.8000,"publicationDate":"2025-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12448786/pdf/","citationCount":"0","resultStr":"{\"title\":\"A retrieval-augmented knowledge mining method with deep thinking LLMs for biomedical research and clinical support.\",\"authors\":\"Yichun Feng, Jiawei Wang, Ruikun He, Lu Zhou, Yixue Li\",\"doi\":\"10.1093/gigascience/giaf109\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Knowledge graphs and large language models (LLMs) are key tools for biomedical knowledge integration and reasoning, facilitating structured organization of scientific articles and discovery of complex semantic relationships. However, current methods face challenges: knowledge graph construction is limited by complex terminology, data heterogeneity, and rapid knowledge evolution, while LLMs show limitations in retrieval and reasoning, making it difficult to uncover cross-document associations and reasoning pathways.</p><p><strong>Results: </strong>We propose a pipeline that uses LLMs to construct a Biomedical Stratified Knowledge Graph (BioStrataKG) from large-scale articles and builds the Biomedical Cross-Document Question Answering Dataset (BioCDQA) to evaluate latent knowledge retrieval and multihop reasoning. We then introduce Integrated and Progressive Retrieval-Augmented Reasoning (IP-RAR) to enhance retrieval accuracy and knowledge reasoning. IP-RAR maximizes information recall through integrated reasoning-based retrieval and refines knowledge via progressive reasoning-based generation, using self-reflection to achieve deep thinking and precise contextual understanding. Experiments show that IP-RAR improves document retrieval F1 score by 20% and answer generation accuracy by 25% over existing methods.</p><p><strong>Conclusions: </strong>The IP-RAR helps doctors efficiently integrate treatment evidence to inform the development of personalized medication plans and enables researchers to analyze advancements and research gaps, accelerating the hypothesis generation phase of scientific discovery and decision-making.</p>\",\"PeriodicalId\":12581,\"journal\":{\"name\":\"GigaScience\",\"volume\":\"14 \",\"pages\":\"\"},\"PeriodicalIF\":11.8000,\"publicationDate\":\"2025-01-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12448786/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"GigaScience\",\"FirstCategoryId\":\"99\",\"ListUrlMain\":\"https://doi.org/10.1093/gigascience/giaf109\",\"RegionNum\":2,\"RegionCategory\":\"生物学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"MULTIDISCIPLINARY SCIENCES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"GigaScience","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1093/gigascience/giaf109","RegionNum":2,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MULTIDISCIPLINARY SCIENCES","Score":null,"Total":0}
A retrieval-augmented knowledge mining method with deep thinking LLMs for biomedical research and clinical support.
Background: Knowledge graphs and large language models (LLMs) are key tools for biomedical knowledge integration and reasoning, facilitating structured organization of scientific articles and discovery of complex semantic relationships. However, current methods face challenges: knowledge graph construction is limited by complex terminology, data heterogeneity, and rapid knowledge evolution, while LLMs show limitations in retrieval and reasoning, making it difficult to uncover cross-document associations and reasoning pathways.
Results: We propose a pipeline that uses LLMs to construct a Biomedical Stratified Knowledge Graph (BioStrataKG) from large-scale articles and builds the Biomedical Cross-Document Question Answering Dataset (BioCDQA) to evaluate latent knowledge retrieval and multihop reasoning. We then introduce Integrated and Progressive Retrieval-Augmented Reasoning (IP-RAR) to enhance retrieval accuracy and knowledge reasoning. IP-RAR maximizes information recall through integrated reasoning-based retrieval and refines knowledge via progressive reasoning-based generation, using self-reflection to achieve deep thinking and precise contextual understanding. Experiments show that IP-RAR improves document retrieval F1 score by 20% and answer generation accuracy by 25% over existing methods.
Conclusions: The IP-RAR helps doctors efficiently integrate treatment evidence to inform the development of personalized medication plans and enables researchers to analyze advancements and research gaps, accelerating the hypothesis generation phase of scientific discovery and decision-making.
期刊介绍:
GigaScience seeks to transform data dissemination and utilization in the life and biomedical sciences. As an online open-access open-data journal, it specializes in publishing "big-data" studies encompassing various fields. Its scope includes not only "omic" type data and the fields of high-throughput biology currently serviced by large public repositories, but also the growing range of more difficult-to-access data, such as imaging, neuroscience, ecology, cohort data, systems biology and other new types of large-scale shareable data.