{"title":"Language model collaboration for relation extraction from classical Chinese historical documents","authors":"Xuemei Tang , Linxu Wang , Jun Wang","doi":"10.1016/j.ipm.2025.104286","DOIUrl":null,"url":null,"abstract":"<div><div>Classical Chinese historical documents are invaluable for Chinese cultural heritage and history research, while they remain underexplored within natural language processing (NLP) due to limited annotated resources and linguistic evolution spanning thousands of years. Addressing the challenges presented by this low annotated resource domain, we develop a relation extraction (RE) corpus that preserves the characteristics of classical Chinese documents. Utilizing this corpus, we explore RE in classical Chinese documents through a collaboration framework that integrates small pre-trained language models (SLMs), such as BERT, with large language models (LLMs) like GPT-3.5. SLMs can quickly adapt to specific tasks given sufficient supervised data but often struggle with few-shot scenarios. Conversely, LLMs leverage broad domain knowledge to handle few-shot challenges but face limitations when processing lengthy input sequences. Combining these complementary strengths, we propose a “train-guide-predict” collaboration framework, where a small language model corporate with a large language model (SLCoLM). This framework enables SLMs to capture task-specific knowledge for head relation categories, while LLMs offer insights for few-shot relation categories. Experimental results show that SLCoLM outperforms both fine-tuned SLMs and LLMs using in-context learning (ICL). It also helps mitigate the long-tail problem in classical Chinese historical documents.</div></div>","PeriodicalId":50365,"journal":{"name":"Information Processing & Management","volume":"63 1","pages":"Article 104286"},"PeriodicalIF":6.9000,"publicationDate":"2025-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Processing & Management","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0306457325002274","RegionNum":1,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Classical Chinese historical documents are invaluable for Chinese cultural heritage and history research, while they remain underexplored within natural language processing (NLP) due to limited annotated resources and linguistic evolution spanning thousands of years. Addressing the challenges presented by this low annotated resource domain, we develop a relation extraction (RE) corpus that preserves the characteristics of classical Chinese documents. Utilizing this corpus, we explore RE in classical Chinese documents through a collaboration framework that integrates small pre-trained language models (SLMs), such as BERT, with large language models (LLMs) like GPT-3.5. SLMs can quickly adapt to specific tasks given sufficient supervised data but often struggle with few-shot scenarios. Conversely, LLMs leverage broad domain knowledge to handle few-shot challenges but face limitations when processing lengthy input sequences. Combining these complementary strengths, we propose a “train-guide-predict” collaboration framework, where a small language model corporate with a large language model (SLCoLM). This framework enables SLMs to capture task-specific knowledge for head relation categories, while LLMs offer insights for few-shot relation categories. Experimental results show that SLCoLM outperforms both fine-tuned SLMs and LLMs using in-context learning (ICL). It also helps mitigate the long-tail problem in classical Chinese historical documents.
期刊介绍:
Information Processing and Management is dedicated to publishing cutting-edge original research at the convergence of computing and information science. Our scope encompasses theory, methods, and applications across various domains, including advertising, business, health, information science, information technology marketing, and social computing.
We aim to cater to the interests of both primary researchers and practitioners by offering an effective platform for the timely dissemination of advanced and topical issues in this interdisciplinary field. The journal places particular emphasis on original research articles, research survey articles, research method articles, and articles addressing critical applications of research. Join us in advancing knowledge and innovation at the intersection of computing and information science.