Natural vs programming language in LLM knowledge graph construction

IF 7.4 1区 管理学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS
Paolo Gajo, Alberto Barrón-Cedeño
{"title":"Natural vs programming language in LLM knowledge graph construction","authors":"Paolo Gajo,&nbsp;Alberto Barrón-Cedeño","doi":"10.1016/j.ipm.2025.104195","DOIUrl":null,"url":null,"abstract":"<div><div>Research on knowledge graph construction (KGC) has recently shown great promise also thanks to the adoption of large language models (LLM) for the automatic extraction of structured information from raw text. However, most works rely on commercial, closed-source LLMs, hindering reproducibility and accessibility. We explore KGC with smaller, open-weight LLMs and investigate whether they can be used to improve upon the results obtained by systems leveraging bigger, closed-source models. Specifically, we focus on CodeKGC, a prompting framework based on GPT-3.5. We choose a variety of models either pre-trained primarily on natural language or on code and fine-tune them on three datasets used for information extraction. We fine-tune with prompts formatted either in natural language or as Python-like scripts. In addition, we optionally train the models with prompts including chain-of-thought sections. After fine-tuning, the choice of coding vs natural language prompts has a limited impact on performance, while chain-of-thought training mostly leads to a performance decrease. Moreover, we show that a LLM can be outperformed by much smaller versions on this task, after undergoing the same amount of training. We find that in general the selected lightweight LLMs outperform the much larger CodeKGC by as much as 15–20 absolute F<span><math><msub><mrow></mrow><mrow><mn>1</mn></mrow></msub></math></span> points after fine-tuning. The results show that state-of-the-art KGC systems can be developed using smaller and open-weight models, enhancing research transparency, lowering compute requirements, and decreasing third-party API reliance.</div><div>Code:</div></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"62 5","pages":"Article 104195"},"PeriodicalIF":7.4000,"publicationDate":"2025-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457325001360","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0

Abstract

Research on knowledge graph construction (KGC) has recently shown great promise also thanks to the adoption of large language models (LLM) for the automatic extraction of structured information from raw text. However, most works rely on commercial, closed-source LLMs, hindering reproducibility and accessibility. We explore KGC with smaller, open-weight LLMs and investigate whether they can be used to improve upon the results obtained by systems leveraging bigger, closed-source models. Specifically, we focus on CodeKGC, a prompting framework based on GPT-3.5. We choose a variety of models either pre-trained primarily on natural language or on code and fine-tune them on three datasets used for information extraction. We fine-tune with prompts formatted either in natural language or as Python-like scripts. In addition, we optionally train the models with prompts including chain-of-thought sections. After fine-tuning, the choice of coding vs natural language prompts has a limited impact on performance, while chain-of-thought training mostly leads to a performance decrease. Moreover, we show that a LLM can be outperformed by much smaller versions on this task, after undergoing the same amount of training. We find that in general the selected lightweight LLMs outperform the much larger CodeKGC by as much as 15–20 absolute F1 points after fine-tuning. The results show that state-of-the-art KGC systems can be developed using smaller and open-weight models, enhancing research transparency, lowering compute requirements, and decreasing third-party API reliance.
Code:
法学硕士知识图谱构建中的自然语言与编程语言
由于采用大型语言模型(LLM)从原始文本中自动提取结构化信息,知识图构建(KGC)的研究最近显示出巨大的前景。然而,大多数作品依赖于商业的闭源llm,阻碍了可重复性和可访问性。我们用较小的、开放权重的llm来探索KGC,并研究它们是否可以用来改进利用更大的、闭源模型的系统所获得的结果。具体来说,我们关注CodeKGC,这是一个基于GPT-3.5的提示框架。我们选择了多种模型,主要是在自然语言或代码上进行预训练,并在用于信息提取的三个数据集上对它们进行微调。我们使用自然语言格式或类python脚本格式的提示进行微调。此外,我们可以选择使用包含思维链部分的提示来训练模型。经过微调后,选择编码和自然语言提示对性能的影响有限,而思维链训练主要导致性能下降。此外,我们表明,在经过相同数量的训练后,LLM可以被更小的版本在这个任务上表现得更好。我们发现,经过微调后,通常选择的轻量级llm比更大的CodeKGC的性能高出15-20个绝对F1点。结果表明,最先进的KGC系统可以使用更小的开放重量模型开发,提高研究透明度,降低计算需求,减少对第三方API的依赖。代码:
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Information Processing & Management
Information Processing & Management 工程技术-计算机:信息系统
CiteScore
17.00
自引率
11.60%
发文量
276
审稿时长
39 days
期刊介绍: Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing. We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信