Chao Yang , Changyi Li , Xiaodu Hu , Hao Yu , Jinzhi Lu
{"title":"增强知识图交互:具有大型语言模型的全面文本到密码管道","authors":"Chao Yang , Changyi Li , Xiaodu Hu , Hao Yu , Jinzhi Lu","doi":"10.1016/j.ipm.2025.104280","DOIUrl":null,"url":null,"abstract":"<div><div>Knowledge Graphs (KGs) store structured information but typically require specialized query languages, such as Cypher for Neo4j, creating accessibility challenges for users unfamiliar with graph syntax. Large Language Models (LLMs) offer a solution by translating natural language into Cypher queries. However, existing models—including large-scale LLMs (e.g., ChatGPT) and smaller open-source models (e.g., Llama-7B, 8B) often struggle with accurately generating domain-specific queries due to inadequate alignment with KG schemas and limited domain-specific training data. To address these limitations, we propose a training pipeline tailored specifically for domain-aligned Cypher query generation, emphasizing usability for smaller-scale models. Our method integrates template-based synthetic data generation for diverse, high-quality training samples. We combine supervised fine-tuning with preference learning to enhance domain knowledge and Cypher syntax understanding. Additionally, our approach includes a context-aware retrieval mechanism that dynamically incorporates relevant schema elements at inference, improving alignment with domain-specific knowledge. We evaluated our method on the Hetionet biomedical KG using a benchmark dataset of 240 queries across three complexity levels. Our results show that our context-aware prompting achieves a substantial improvement, increasing component matching accuracy by 23.6% for ChatGPT-4o over the vanilla prompt baseline. When applying our full training pipeline to smaller-scale models, CodeLlama-13B* achieves an execution accuracy of 69.2%, nearly matching ChatGPT-4o’s 72.1%. Importantly, our approach significantly narrows the performance gap, enabling smaller models to effectively manage complex, domain-specific tasks previously dominated by larger models. These findings demonstrate that our method is scalable, computationally efficient, and robust for practical Cypher query generation applications.</div></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"63 1","pages":"Article 104280"},"PeriodicalIF":7.4000,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Enhancing knowledge graph interactions: A comprehensive Text-to-Cypher pipeline with large language models\",\"authors\":\"Chao Yang , Changyi Li , Xiaodu Hu , Hao Yu , Jinzhi Lu\",\"doi\":\"10.1016/j.ipm.2025.104280\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Knowledge Graphs (KGs) store structured information but typically require specialized query languages, such as Cypher for Neo4j, creating accessibility challenges for users unfamiliar with graph syntax. Large Language Models (LLMs) offer a solution by translating natural language into Cypher queries. However, existing models—including large-scale LLMs (e.g., ChatGPT) and smaller open-source models (e.g., Llama-7B, 8B) often struggle with accurately generating domain-specific queries due to inadequate alignment with KG schemas and limited domain-specific training data. To address these limitations, we propose a training pipeline tailored specifically for domain-aligned Cypher query generation, emphasizing usability for smaller-scale models. Our method integrates template-based synthetic data generation for diverse, high-quality training samples. We combine supervised fine-tuning with preference learning to enhance domain knowledge and Cypher syntax understanding. Additionally, our approach includes a context-aware retrieval mechanism that dynamically incorporates relevant schema elements at inference, improving alignment with domain-specific knowledge. We evaluated our method on the Hetionet biomedical KG using a benchmark dataset of 240 queries across three complexity levels. Our results show that our context-aware prompting achieves a substantial improvement, increasing component matching accuracy by 23.6% for ChatGPT-4o over the vanilla prompt baseline. When applying our full training pipeline to smaller-scale models, CodeLlama-13B* achieves an execution accuracy of 69.2%, nearly matching ChatGPT-4o’s 72.1%. Importantly, our approach significantly narrows the performance gap, enabling smaller models to effectively manage complex, domain-specific tasks previously dominated by larger models. These findings demonstrate that our method is scalable, computationally efficient, and robust for practical Cypher query generation applications.</div></div>\",\"PeriodicalId\":50365,\"journal\":{\"name\":\"Information Processing & Management\",\"volume\":\"63 1\",\"pages\":\"Article 104280\"},\"PeriodicalIF\":7.4000,\"publicationDate\":\"2025-07-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Information Processing & Management\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0306457325002213\",\"RegionNum\":1,\"RegionCategory\":\"管理学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457325002213","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Enhancing knowledge graph interactions: A comprehensive Text-to-Cypher pipeline with large language models
Knowledge Graphs (KGs) store structured information but typically require specialized query languages, such as Cypher for Neo4j, creating accessibility challenges for users unfamiliar with graph syntax. Large Language Models (LLMs) offer a solution by translating natural language into Cypher queries. However, existing models—including large-scale LLMs (e.g., ChatGPT) and smaller open-source models (e.g., Llama-7B, 8B) often struggle with accurately generating domain-specific queries due to inadequate alignment with KG schemas and limited domain-specific training data. To address these limitations, we propose a training pipeline tailored specifically for domain-aligned Cypher query generation, emphasizing usability for smaller-scale models. Our method integrates template-based synthetic data generation for diverse, high-quality training samples. We combine supervised fine-tuning with preference learning to enhance domain knowledge and Cypher syntax understanding. Additionally, our approach includes a context-aware retrieval mechanism that dynamically incorporates relevant schema elements at inference, improving alignment with domain-specific knowledge. We evaluated our method on the Hetionet biomedical KG using a benchmark dataset of 240 queries across three complexity levels. Our results show that our context-aware prompting achieves a substantial improvement, increasing component matching accuracy by 23.6% for ChatGPT-4o over the vanilla prompt baseline. When applying our full training pipeline to smaller-scale models, CodeLlama-13B* achieves an execution accuracy of 69.2%, nearly matching ChatGPT-4o’s 72.1%. Importantly, our approach significantly narrows the performance gap, enabling smaller models to effectively manage complex, domain-specific tasks previously dominated by larger models. These findings demonstrate that our method is scalable, computationally efficient, and robust for practical Cypher query generation applications.
期刊介绍:
Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing.
We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.