ForestryBERT:一个预先训练的语言模型,具有持续学习能力,适应不断变化的林业文本

IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Jingwei Tan , Huaiqing Zhang , Jie Yang , Yang Liu , Dongping Zheng , Xiqin Liu
{"title":"ForestryBERT:一个预先训练的语言模型,具有持续学习能力,适应不断变化的林业文本","authors":"Jingwei Tan ,&nbsp;Huaiqing Zhang ,&nbsp;Jie Yang ,&nbsp;Yang Liu ,&nbsp;Dongping Zheng ,&nbsp;Xiqin Liu","doi":"10.1016/j.knosys.2025.113706","DOIUrl":null,"url":null,"abstract":"<div><div>Efficient utilization and enhancement of the growing volume of forestry-related textual data is crucial for advancing smart forestry. Pre-trained language models (PLMs) have demonstrated strong capabilities in processing large unlabeled text. To adapt a general PLM to a specific domain, existing studies typically employ a single target corpus for one-time pre-training to incorporate domain-specific knowledge. However, this approach fails to align with the dynamic processes of continuous adaptation and knowledge accumulation that are essential in real-world scenarios. Here, this study proposes ForestryBERT, a BERT model that is continually pre-trained on three Chinese forestry corpora comprising 204,636 texts (19.66 million words) using a continual learning method called DAS.<span><span><sup>1</sup></span></span> We evaluate the model on both text classification and extractive question answering tasks using five datasets for each task. Experimental results show that ForestryBERT outperforms five general-domain PLMs and further pre-trained PLMs (without DAS) across eight custom-built forestry datasets. Moreover, PLMs using DAS exhibit a forgetting rate of 0.65, which is 1.41 lower than PLMs without DAS, and demonstrate superior performance on both new and old tasks. These findings indicate that ForestryBERT, based on continual learning, effectively mitigates catastrophic forgetting and facilitates the continuous acquisition of new knowledge. It expands its forestry knowledge by continually absorbing new unlabeled forestry corpora, showcasing its potential for sustainability and scalability. Our study provides a strategy for handling the growing volume of forestry text during PLM construction, a strategy that is also applicable to other domains.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"320 ","pages":"Article 113706"},"PeriodicalIF":7.6000,"publicationDate":"2025-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ForestryBERT: A pre-trained language model with continual learning adapted to changing forestry text\",\"authors\":\"Jingwei Tan ,&nbsp;Huaiqing Zhang ,&nbsp;Jie Yang ,&nbsp;Yang Liu ,&nbsp;Dongping Zheng ,&nbsp;Xiqin Liu\",\"doi\":\"10.1016/j.knosys.2025.113706\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Efficient utilization and enhancement of the growing volume of forestry-related textual data is crucial for advancing smart forestry. Pre-trained language models (PLMs) have demonstrated strong capabilities in processing large unlabeled text. To adapt a general PLM to a specific domain, existing studies typically employ a single target corpus for one-time pre-training to incorporate domain-specific knowledge. However, this approach fails to align with the dynamic processes of continuous adaptation and knowledge accumulation that are essential in real-world scenarios. Here, this study proposes ForestryBERT, a BERT model that is continually pre-trained on three Chinese forestry corpora comprising 204,636 texts (19.66 million words) using a continual learning method called DAS.<span><span><sup>1</sup></span></span> We evaluate the model on both text classification and extractive question answering tasks using five datasets for each task. Experimental results show that ForestryBERT outperforms five general-domain PLMs and further pre-trained PLMs (without DAS) across eight custom-built forestry datasets. Moreover, PLMs using DAS exhibit a forgetting rate of 0.65, which is 1.41 lower than PLMs without DAS, and demonstrate superior performance on both new and old tasks. These findings indicate that ForestryBERT, based on continual learning, effectively mitigates catastrophic forgetting and facilitates the continuous acquisition of new knowledge. It expands its forestry knowledge by continually absorbing new unlabeled forestry corpora, showcasing its potential for sustainability and scalability. Our study provides a strategy for handling the growing volume of forestry text during PLM construction, a strategy that is also applicable to other domains.</div></div>\",\"PeriodicalId\":49939,\"journal\":{\"name\":\"Knowledge-Based Systems\",\"volume\":\"320 \",\"pages\":\"Article 113706\"},\"PeriodicalIF\":7.6000,\"publicationDate\":\"2025-05-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Knowledge-Based Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S095070512500752X\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S095070512500752X","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

摘要

有效利用和增强日益增长的与森林有关的文本数据对于推进智慧林业至关重要。预训练语言模型(PLMs)在处理大型未标记文本方面已经显示出强大的能力。为了使通用PLM适应特定领域,现有研究通常使用单个目标语料库进行一次性预训练,以纳入特定领域的知识。然而,这种方法无法与现实世界中必不可少的持续适应和知识积累的动态过程保持一致。在这里,本研究提出了ForestryBERT,这是一个BERT模型,它使用一种称为das的持续学习方法在三个包含204,636个文本(1,966万字)的中国林业语料库上进行持续预训练。1我们在文本分类和抽取问答任务上对模型进行了评估,每个任务使用五个数据集。实验结果表明,ForestryBERT在8个定制林业数据集上优于5个通用领域plm和进一步预训练的plm(没有DAS)。此外,使用DAS的PLMs的遗忘率为0.65,比未使用DAS的PLMs低1.41,并且在新任务和旧任务上都表现出更好的表现。这些发现表明,基于持续学习的ForestryBERT有效地减轻了灾难性遗忘,促进了新知识的持续获取。它通过不断吸收新的未标记的林业语料库来扩展其林业知识,展示其可持续性和可扩展性的潜力。我们的研究提供了一种在PLM构建过程中处理不断增长的林业文本量的策略,该策略也适用于其他领域。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
ForestryBERT: A pre-trained language model with continual learning adapted to changing forestry text
Efficient utilization and enhancement of the growing volume of forestry-related textual data is crucial for advancing smart forestry. Pre-trained language models (PLMs) have demonstrated strong capabilities in processing large unlabeled text. To adapt a general PLM to a specific domain, existing studies typically employ a single target corpus for one-time pre-training to incorporate domain-specific knowledge. However, this approach fails to align with the dynamic processes of continuous adaptation and knowledge accumulation that are essential in real-world scenarios. Here, this study proposes ForestryBERT, a BERT model that is continually pre-trained on three Chinese forestry corpora comprising 204,636 texts (19.66 million words) using a continual learning method called DAS.1 We evaluate the model on both text classification and extractive question answering tasks using five datasets for each task. Experimental results show that ForestryBERT outperforms five general-domain PLMs and further pre-trained PLMs (without DAS) across eight custom-built forestry datasets. Moreover, PLMs using DAS exhibit a forgetting rate of 0.65, which is 1.41 lower than PLMs without DAS, and demonstrate superior performance on both new and old tasks. These findings indicate that ForestryBERT, based on continual learning, effectively mitigates catastrophic forgetting and facilitates the continuous acquisition of new knowledge. It expands its forestry knowledge by continually absorbing new unlabeled forestry corpora, showcasing its potential for sustainability and scalability. Our study provides a strategy for handling the growing volume of forestry text during PLM construction, a strategy that is also applicable to other domains.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Knowledge-Based Systems
Knowledge-Based Systems 工程技术-计算机:人工智能
CiteScore
14.80
自引率
12.50%
发文量
1245
审稿时长
7.8 months
期刊介绍: Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信