增强知识图交互：具有大型语言模型的全面文本到密码管道

IF 7.4 1区管理学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Information Processing & Management Pub Date : 2025-07-21 DOI:10.1016/j.ipm.2025.104280

Chao Yang , Changyi Li , Xiaodu Hu , Hao Yu , Jinzhi Lu

{"title":"增强知识图交互：具有大型语言模型的全面文本到密码管道","authors":"Chao Yang , Changyi Li , Xiaodu Hu , Hao Yu , Jinzhi Lu","doi":"10.1016/j.ipm.2025.104280","DOIUrl":null,"url":null,"abstract":"<div><div>Knowledge Graphs (KGs) store structured information but typically require specialized query languages, such as Cypher for Neo4j, creating accessibility challenges for users unfamiliar with graph syntax. Large Language Models (LLMs) offer a solution by translating natural language into Cypher queries. However, existing models—including large-scale LLMs (e.g., ChatGPT) and smaller open-source models (e.g., Llama-7B, 8B) often struggle with accurately generating domain-specific queries due to inadequate alignment with KG schemas and limited domain-specific training data. To address these limitations, we propose a training pipeline tailored specifically for domain-aligned Cypher query generation, emphasizing usability for smaller-scale models. Our method integrates template-based synthetic data generation for diverse, high-quality training samples. We combine supervised fine-tuning with preference learning to enhance domain knowledge and Cypher syntax understanding. Additionally, our approach includes a context-aware retrieval mechanism that dynamically incorporates relevant schema elements at inference, improving alignment with domain-specific knowledge. We evaluated our method on the Hetionet biomedical KG using a benchmark dataset of 240 queries across three complexity levels. Our results show that our context-aware prompting achieves a substantial improvement, increasing component matching accuracy by 23.6% for ChatGPT-4o over the vanilla prompt baseline. When applying our full training pipeline to smaller-scale models, CodeLlama-13B* achieves an execution accuracy of 69.2%, nearly matching ChatGPT-4o’s 72.1%. Importantly, our approach significantly narrows the performance gap, enabling smaller models to effectively manage complex, domain-specific tasks previously dominated by larger models. These findings demonstrate that our method is scalable, computationally efficient, and robust for practical Cypher query generation applications.</div></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"63 1","pages":"Article 104280"},"PeriodicalIF":7.4000,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Enhancing knowledge graph interactions: A comprehensive Text-to-Cypher pipeline with large language models\",\"authors\":\"Chao Yang , Changyi Li , Xiaodu Hu , Hao Yu , Jinzhi Lu\",\"doi\":\"10.1016/j.ipm.2025.104280\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Knowledge Graphs (KGs) store structured information but typically require specialized query languages, such as Cypher for Neo4j, creating accessibility challenges for users unfamiliar with graph syntax. Large Language Models (LLMs) offer a solution by translating natural language into Cypher queries. However, existing models—including large-scale LLMs (e.g., ChatGPT) and smaller open-source models (e.g., Llama-7B, 8B) often struggle with accurately generating domain-specific queries due to inadequate alignment with KG schemas and limited domain-specific training data. To address these limitations, we propose a training pipeline tailored specifically for domain-aligned Cypher query generation, emphasizing usability for smaller-scale models. Our method integrates template-based synthetic data generation for diverse, high-quality training samples. We combine supervised fine-tuning with preference learning to enhance domain knowledge and Cypher syntax understanding. Additionally, our approach includes a context-aware retrieval mechanism that dynamically incorporates relevant schema elements at inference, improving alignment with domain-specific knowledge. We evaluated our method on the Hetionet biomedical KG using a benchmark dataset of 240 queries across three complexity levels. Our results show that our context-aware prompting achieves a substantial improvement, increasing component matching accuracy by 23.6% for ChatGPT-4o over the vanilla prompt baseline. When applying our full training pipeline to smaller-scale models, CodeLlama-13B* achieves an execution accuracy of 69.2%, nearly matching ChatGPT-4o’s 72.1%. Importantly, our approach significantly narrows the performance gap, enabling smaller models to effectively manage complex, domain-specific tasks previously dominated by larger models. These findings demonstrate that our method is scalable, computationally efficient, and robust for practical Cypher query generation applications.</div></div>\",\"PeriodicalId\":50365,\"journal\":{\"name\":\"Information Processing & Management\",\"volume\":\"63 1\",\"pages\":\"Article 104280\"},\"PeriodicalIF\":7.4000,\"publicationDate\":\"2025-07-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Processing & Management\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0306457325002213\",\"RegionNum\":1,\"RegionCategory\":\"管理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457325002213","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

知识图（Knowledge Graphs, KGs）存储结构化信息，但通常需要专门的查询语言，比如用于Neo4j的Cypher，这给不熟悉图语法的用户带来了可访问性挑战。大型语言模型（llm）通过将自然语言翻译成Cypher查询提供了解决方案。然而，现有的模型——包括大型llm（例如，ChatGPT）和较小的开源模型（例如，Llama-7B, 8B）——由于与KG模式的不一致和有限的领域特定的训练数据，经常难以准确地生成特定于领域的查询。为了解决这些限制，我们提出了一个专门为领域对齐Cypher查询生成量身定制的训练管道，强调小规模模型的可用性。我们的方法集成了基于模板的合成数据生成，用于各种高质量的训练样本。我们将监督微调与偏好学习相结合，以增强领域知识和Cypher语法理解。此外，我们的方法还包括一个上下文感知检索机制，该机制可以在推理时动态地合并相关的模式元素，从而改进与特定领域知识的一致性。我们在Hetionet生物医学KG上使用三个复杂级别的240个查询的基准数据集评估了我们的方法。我们的结果表明，我们的上下文感知提示实现了实质性的改进，在普通提示基线的基础上，chatgpt - 40的组件匹配精度提高了23.6%。当将我们的完整训练管道应用于较小规模的模型时，codellam - 13b *的执行精度达到69.2%，几乎与chatgpt - 40的72.1%相匹配。重要的是，我们的方法显著地缩小了性能差距，使较小的模型能够有效地管理以前由较大模型主导的复杂的、特定于领域的任务。这些发现表明，我们的方法具有可扩展性，计算效率高，并且对于实际的Cypher查询生成应用具有鲁棒性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Enhancing knowledge graph interactions: A comprehensive Text-to-Cypher pipeline with large language models

Knowledge Graphs (KGs) store structured information but typically require specialized query languages, such as Cypher for Neo4j, creating accessibility challenges for users unfamiliar with graph syntax. Large Language Models (LLMs) offer a solution by translating natural language into Cypher queries. However, existing models—including large-scale LLMs (e.g., ChatGPT) and smaller open-source models (e.g., Llama-7B, 8B) often struggle with accurately generating domain-specific queries due to inadequate alignment with KG schemas and limited domain-specific training data. To address these limitations, we propose a training pipeline tailored specifically for domain-aligned Cypher query generation, emphasizing usability for smaller-scale models. Our method integrates template-based synthetic data generation for diverse, high-quality training samples. We combine supervised fine-tuning with preference learning to enhance domain knowledge and Cypher syntax understanding. Additionally, our approach includes a context-aware retrieval mechanism that dynamically incorporates relevant schema elements at inference, improving alignment with domain-specific knowledge. We evaluated our method on the Hetionet biomedical KG using a benchmark dataset of 240 queries across three complexity levels. Our results show that our context-aware prompting achieves a substantial improvement, increasing component matching accuracy by 23.6% for ChatGPT-4o over the vanilla prompt baseline. When applying our full training pipeline to smaller-scale models, CodeLlama-13B* achieves an execution accuracy of 69.2%, nearly matching ChatGPT-4o’s 72.1%. Importantly, our approach significantly narrows the performance gap, enabling smaller models to effectively manage complex, domain-specific tasks previously dominated by larger models. These findings demonstrate that our method is scalable, computationally efficient, and robust for practical Cypher query generation applications.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Information Processing & Management 工程技术-计算机：信息系统

CiteScore

17.00

自引率

11.60%

发文量

276

审稿时长

39 days

期刊介绍： Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing. We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.