{"title":"Harnessing the Power of Semi-Structured Knowledge and LLMs with Triplet-Based Prefiltering for Question Answering","authors":"Derian Boer, Fabian Koch, Stefan Kramer","doi":"arxiv-2409.00861","DOIUrl":null,"url":null,"abstract":"Large Language Models (LLMs) frequently lack domain-specific knowledge and\neven fine-tuned models tend to hallucinate. Hence, more reliable models that\ncan include external knowledge are needed. We present a pipeline, 4StepFocus,\nand specifically a preprocessing step, that can substantially improve the\nanswers of LLMs. This is achieved by providing guided access to external\nknowledge making use of the model's ability to capture relational context and\nconduct rudimentary reasoning by themselves. The method narrows down\npotentially correct answers by triplets-based searches in a semi-structured\nknowledge base in a direct, traceable fashion, before switching to latent\nrepresentations for ranking those candidates based on unstructured data. This\ndistinguishes it from related methods that are purely based on latent\nrepresentations. 4StepFocus consists of the steps: 1) Triplet generation for\nextraction of relational data by an LLM, 2) substitution of variables in those\ntriplets to narrow down answer candidates employing a knowledge graph, 3)\nsorting remaining candidates with a vector similarity search involving\nassociated non-structured data, 4) reranking the best candidates by the LLM\nwith background data provided. Experiments on a medical, a product\nrecommendation, and an academic paper search test set demonstrate that this\napproach is indeed a powerful augmentation. It not only adds relevant traceable\nbackground information from information retrieval, but also improves\nperformance considerably in comparison to state-of-the-art methods. This paper\npresents a novel, largely unexplored direction and therefore provides a wide\nrange of future work opportunities. Used source code is available at\nhttps://github.com/kramerlab/4StepFocus.","PeriodicalId":501208,"journal":{"name":"arXiv - CS - Logic in Computer Science","volume":"75 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Logic in Computer Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.00861","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Large Language Models (LLMs) frequently lack domain-specific knowledge and
even fine-tuned models tend to hallucinate. Hence, more reliable models that
can include external knowledge are needed. We present a pipeline, 4StepFocus,
and specifically a preprocessing step, that can substantially improve the
answers of LLMs. This is achieved by providing guided access to external
knowledge making use of the model's ability to capture relational context and
conduct rudimentary reasoning by themselves. The method narrows down
potentially correct answers by triplets-based searches in a semi-structured
knowledge base in a direct, traceable fashion, before switching to latent
representations for ranking those candidates based on unstructured data. This
distinguishes it from related methods that are purely based on latent
representations. 4StepFocus consists of the steps: 1) Triplet generation for
extraction of relational data by an LLM, 2) substitution of variables in those
triplets to narrow down answer candidates employing a knowledge graph, 3)
sorting remaining candidates with a vector similarity search involving
associated non-structured data, 4) reranking the best candidates by the LLM
with background data provided. Experiments on a medical, a product
recommendation, and an academic paper search test set demonstrate that this
approach is indeed a powerful augmentation. It not only adds relevant traceable
background information from information retrieval, but also improves
performance considerably in comparison to state-of-the-art methods. This paper
presents a novel, largely unexplored direction and therefore provides a wide
range of future work opportunities. Used source code is available at
https://github.com/kramerlab/4StepFocus.