Harnessing the Power of Semi-Structured Knowledge and LLMs with Triplet-Based Prefiltering for Question Answering

arXiv - CS - Logic in Computer Science Pub Date : 2024-09-01 DOI:arxiv-2409.00861

Derian Boer, Fabian Koch, Stefan Kramer

{"title":"Harnessing the Power of Semi-Structured Knowledge and LLMs with Triplet-Based Prefiltering for Question Answering","authors":"Derian Boer, Fabian Koch, Stefan Kramer","doi":"arxiv-2409.00861","DOIUrl":null,"url":null,"abstract":"Large Language Models (LLMs) frequently lack domain-specific knowledge and\neven fine-tuned models tend to hallucinate. Hence, more reliable models that\ncan include external knowledge are needed. We present a pipeline, 4StepFocus,\nand specifically a preprocessing step, that can substantially improve the\nanswers of LLMs. This is achieved by providing guided access to external\nknowledge making use of the model's ability to capture relational context and\nconduct rudimentary reasoning by themselves. The method narrows down\npotentially correct answers by triplets-based searches in a semi-structured\nknowledge base in a direct, traceable fashion, before switching to latent\nrepresentations for ranking those candidates based on unstructured data. This\ndistinguishes it from related methods that are purely based on latent\nrepresentations. 4StepFocus consists of the steps: 1) Triplet generation for\nextraction of relational data by an LLM, 2) substitution of variables in those\ntriplets to narrow down answer candidates employing a knowledge graph, 3)\nsorting remaining candidates with a vector similarity search involving\nassociated non-structured data, 4) reranking the best candidates by the LLM\nwith background data provided. Experiments on a medical, a product\nrecommendation, and an academic paper search test set demonstrate that this\napproach is indeed a powerful augmentation. It not only adds relevant traceable\nbackground information from information retrieval, but also improves\nperformance considerably in comparison to state-of-the-art methods. This paper\npresents a novel, largely unexplored direction and therefore provides a wide\nrange of future work opportunities. Used source code is available at\nhttps://github.com/kramerlab/4StepFocus.","PeriodicalId":501208,"journal":{"name":"arXiv - CS - Logic in Computer Science","volume":"75 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Logic in Computer Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.00861","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Large Language Models (LLMs) frequently lack domain-specific knowledge and even fine-tuned models tend to hallucinate. Hence, more reliable models that can include external knowledge are needed. We present a pipeline, 4StepFocus, and specifically a preprocessing step, that can substantially improve the answers of LLMs. This is achieved by providing guided access to external knowledge making use of the model's ability to capture relational context and conduct rudimentary reasoning by themselves. The method narrows down potentially correct answers by triplets-based searches in a semi-structured knowledge base in a direct, traceable fashion, before switching to latent representations for ranking those candidates based on unstructured data. This distinguishes it from related methods that are purely based on latent representations. 4StepFocus consists of the steps: 1) Triplet generation for extraction of relational data by an LLM, 2) substitution of variables in those triplets to narrow down answer candidates employing a knowledge graph, 3) sorting remaining candidates with a vector similarity search involving associated non-structured data, 4) reranking the best candidates by the LLM with background data provided. Experiments on a medical, a product recommendation, and an academic paper search test set demonstrate that this approach is indeed a powerful augmentation. It not only adds relevant traceable background information from information retrieval, but also improves performance considerably in comparison to state-of-the-art methods. This paper presents a novel, largely unexplored direction and therefore provides a wide range of future work opportunities. Used source code is available at https://github.com/kramerlab/4StepFocus.

查看原文本刊更多论文

利用基于三重预过滤的半结构化知识和 LLMs 的力量进行问题解答

大型语言模型（LLM）经常缺乏特定领域的知识，即使是经过微调的模型也往往会产生幻觉。因此，我们需要能包含外部知识的更可靠的模型。我们提出了一个管道--4StepFocus，特别是一个预处理步骤，它可以大大提高 LLM 的回答能力。这是通过利用模型捕捉关系上下文和自行进行基本推理的能力，提供对外部知识的引导访问来实现的。该方法通过在半结构化知识库中进行基于三元组的搜索，以直接、可追溯的方式缩小了潜在正确答案的范围，然后再切换到基于非结构化数据的潜在表示，对这些候选答案进行排序。这使它有别于纯粹基于潜在表示的相关方法。4StepFocus 包括以下步骤：1) 通过 LLM 生成三元组以提取关系数据；2) 利用知识图谱替换三元组中的变量以缩小候选答案的范围；3) 通过向量相似性搜索（涉及相关的非结构化数据）对剩余候选答案进行排序；4) 利用 LLM 提供的背景数据对最佳候选答案进行重新排序。在医疗、产品推荐和学术论文搜索测试集上的实验证明，这种方法确实是一种强大的增强功能。它不仅从信息检索中添加了相关的可追溯背景信息，而且与最先进的方法相比，还大大提高了性能。本文提出了一个新颖的、基本未被探索的方向，因此为未来的工作提供了更广阔的空间。使用的源代码可在https://github.com/kramerlab/4StepFocus。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

arXiv - CS - Logic in Computer Science

自引率

0.00%

发文量