Jingwei Tan , Huaiqing Zhang , Jie Yang , Yang Liu , Dongping Zheng , Xiqin Liu
{"title":"ForestryBERT:一个预先训练的语言模型,具有持续学习能力,适应不断变化的林业文本","authors":"Jingwei Tan , Huaiqing Zhang , Jie Yang , Yang Liu , Dongping Zheng , Xiqin Liu","doi":"10.1016/j.knosys.2025.113706","DOIUrl":null,"url":null,"abstract":"<div><div>Efficient utilization and enhancement of the growing volume of forestry-related textual data is crucial for advancing smart forestry. Pre-trained language models (PLMs) have demonstrated strong capabilities in processing large unlabeled text. To adapt a general PLM to a specific domain, existing studies typically employ a single target corpus for one-time pre-training to incorporate domain-specific knowledge. However, this approach fails to align with the dynamic processes of continuous adaptation and knowledge accumulation that are essential in real-world scenarios. Here, this study proposes ForestryBERT, a BERT model that is continually pre-trained on three Chinese forestry corpora comprising 204,636 texts (19.66 million words) using a continual learning method called DAS.<span><span><sup>1</sup></span></span> We evaluate the model on both text classification and extractive question answering tasks using five datasets for each task. Experimental results show that ForestryBERT outperforms five general-domain PLMs and further pre-trained PLMs (without DAS) across eight custom-built forestry datasets. Moreover, PLMs using DAS exhibit a forgetting rate of 0.65, which is 1.41 lower than PLMs without DAS, and demonstrate superior performance on both new and old tasks. These findings indicate that ForestryBERT, based on continual learning, effectively mitigates catastrophic forgetting and facilitates the continuous acquisition of new knowledge. It expands its forestry knowledge by continually absorbing new unlabeled forestry corpora, showcasing its potential for sustainability and scalability. Our study provides a strategy for handling the growing volume of forestry text during PLM construction, a strategy that is also applicable to other domains.</div></div>","PeriodicalId":49939,"journal":{"name":"Knowledge-Based Systems","volume":"320 ","pages":"Article 113706"},"PeriodicalIF":7.6000,"publicationDate":"2025-05-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"ForestryBERT: A pre-trained language model with continual learning adapted to changing forestry text\",\"authors\":\"Jingwei Tan , Huaiqing Zhang , Jie Yang , Yang Liu , Dongping Zheng , Xiqin Liu\",\"doi\":\"10.1016/j.knosys.2025.113706\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><div>Efficient utilization and enhancement of the growing volume of forestry-related textual data is crucial for advancing smart forestry. Pre-trained language models (PLMs) have demonstrated strong capabilities in processing large unlabeled text. To adapt a general PLM to a specific domain, existing studies typically employ a single target corpus for one-time pre-training to incorporate domain-specific knowledge. However, this approach fails to align with the dynamic processes of continuous adaptation and knowledge accumulation that are essential in real-world scenarios. Here, this study proposes ForestryBERT, a BERT model that is continually pre-trained on three Chinese forestry corpora comprising 204,636 texts (19.66 million words) using a continual learning method called DAS.<span><span><sup>1</sup></span></span> We evaluate the model on both text classification and extractive question answering tasks using five datasets for each task. Experimental results show that ForestryBERT outperforms five general-domain PLMs and further pre-trained PLMs (without DAS) across eight custom-built forestry datasets. Moreover, PLMs using DAS exhibit a forgetting rate of 0.65, which is 1.41 lower than PLMs without DAS, and demonstrate superior performance on both new and old tasks. These findings indicate that ForestryBERT, based on continual learning, effectively mitigates catastrophic forgetting and facilitates the continuous acquisition of new knowledge. It expands its forestry knowledge by continually absorbing new unlabeled forestry corpora, showcasing its potential for sustainability and scalability. Our study provides a strategy for handling the growing volume of forestry text during PLM construction, a strategy that is also applicable to other domains.</div></div>\",\"PeriodicalId\":49939,\"journal\":{\"name\":\"Knowledge-Based Systems\",\"volume\":\"320 \",\"pages\":\"Article 113706\"},\"PeriodicalIF\":7.6000,\"publicationDate\":\"2025-05-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Knowledge-Based Systems\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S095070512500752X\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Knowledge-Based Systems","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S095070512500752X","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
ForestryBERT: A pre-trained language model with continual learning adapted to changing forestry text
Efficient utilization and enhancement of the growing volume of forestry-related textual data is crucial for advancing smart forestry. Pre-trained language models (PLMs) have demonstrated strong capabilities in processing large unlabeled text. To adapt a general PLM to a specific domain, existing studies typically employ a single target corpus for one-time pre-training to incorporate domain-specific knowledge. However, this approach fails to align with the dynamic processes of continuous adaptation and knowledge accumulation that are essential in real-world scenarios. Here, this study proposes ForestryBERT, a BERT model that is continually pre-trained on three Chinese forestry corpora comprising 204,636 texts (19.66 million words) using a continual learning method called DAS.1 We evaluate the model on both text classification and extractive question answering tasks using five datasets for each task. Experimental results show that ForestryBERT outperforms five general-domain PLMs and further pre-trained PLMs (without DAS) across eight custom-built forestry datasets. Moreover, PLMs using DAS exhibit a forgetting rate of 0.65, which is 1.41 lower than PLMs without DAS, and demonstrate superior performance on both new and old tasks. These findings indicate that ForestryBERT, based on continual learning, effectively mitigates catastrophic forgetting and facilitates the continuous acquisition of new knowledge. It expands its forestry knowledge by continually absorbing new unlabeled forestry corpora, showcasing its potential for sustainability and scalability. Our study provides a strategy for handling the growing volume of forestry text during PLM construction, a strategy that is also applicable to other domains.
期刊介绍:
Knowledge-Based Systems, an international and interdisciplinary journal in artificial intelligence, publishes original, innovative, and creative research results in the field. It focuses on knowledge-based and other artificial intelligence techniques-based systems. The journal aims to support human prediction and decision-making through data science and computation techniques, provide a balanced coverage of theory and practical study, and encourage the development and implementation of knowledge-based intelligence models, methods, systems, and software tools. Applications in business, government, education, engineering, and healthcare are emphasized.