Bridging the safety-specific language model gap: Domain-adaptive pretraining of transformer-based models across several industrial sectors for occupational safety applications

IF 7.5 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Abid Ali Khan Danish, Snehamoy Chatterjee
{"title":"Bridging the safety-specific language model gap: Domain-adaptive pretraining of transformer-based models across several industrial sectors for occupational safety applications","authors":"Abid Ali Khan Danish,&nbsp;Snehamoy Chatterjee","doi":"10.1016/j.eswa.2025.130068","DOIUrl":null,"url":null,"abstract":"<div><div>Occupational safety remains a persistent global challenge despite advancements in regulatory frameworks and safety technologies. Unstructured incident narratives, such as accident reports and safety logs, offer valuable context for understanding workplace hazards but are underutilized due to the gap in the safety-specific language models. This study addresses that gap by adapting pretrained transformer-based models (BERT and ALBERT) to the occupational safety domain through Domain-Adaptive Pretraining (DAPT). We construct a large-scale, multi-source corpus comprising over 2.4 million documents spanning several industrial sectors, including mining, construction, transportation, and chemical processing, augmented with safety-related academic abstracts to preserve general linguistic understanding and mitigate catastrophic forgetting. Using this corpus, we develop two domain-adapted models, safetyBERT and safetyALBERT, through continual pretraining on the masked language modeling objective. Intrinsic evaluation using pseudo-perplexity (PPPL) demonstrates substantial improvements, with safetyBERT and safetyALBERT achieving 76.9% and 90.3% reductions in PPPL, respectively, over their general-domain counterparts. Extrinsic evaluation on the Mine Safety and Health Administration (MSHA) injury dataset across three classification tasks (accident type, mining equipment, and degree of injury) demonstrated consistent performance improvements, with both models outperforming diverse baseline models including general-purpose models (BERT, ALBERT, DistilBERT, RoBERTa), domain-specific scientific model (SciBERT), and large language model (Llama 3.1-8B), with safetyALBERT achieving competitive results despite its parameter-efficient design.. To further assess generalization in low-resource settings, these models were evaluated on the small-scale Alaska insurance claim dataset from mining industry across two classification tasks − claim type and injured body part. Both safetyBERT and safetyALBERT maintained strong performance under this constraint, demonstrating the value of domain adaptation for data-constrained environments. Additionally, multi-task classification on the MSHA dataset using safety domain models showed improved generalization and more balanced performance across underrepresented classes. These findings confirm that DAPT effectively enhances language understanding in safety–critical domains while enabling scalable, resource-efficient deployment. This work lays the foundation for integrating domain-adapted natural language processing (NLP) systems into occupational health and safety management frameworks.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"299 ","pages":"Article 130068"},"PeriodicalIF":7.5000,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S095741742503684X","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Occupational safety remains a persistent global challenge despite advancements in regulatory frameworks and safety technologies. Unstructured incident narratives, such as accident reports and safety logs, offer valuable context for understanding workplace hazards but are underutilized due to the gap in the safety-specific language models. This study addresses that gap by adapting pretrained transformer-based models (BERT and ALBERT) to the occupational safety domain through Domain-Adaptive Pretraining (DAPT). We construct a large-scale, multi-source corpus comprising over 2.4 million documents spanning several industrial sectors, including mining, construction, transportation, and chemical processing, augmented with safety-related academic abstracts to preserve general linguistic understanding and mitigate catastrophic forgetting. Using this corpus, we develop two domain-adapted models, safetyBERT and safetyALBERT, through continual pretraining on the masked language modeling objective. Intrinsic evaluation using pseudo-perplexity (PPPL) demonstrates substantial improvements, with safetyBERT and safetyALBERT achieving 76.9% and 90.3% reductions in PPPL, respectively, over their general-domain counterparts. Extrinsic evaluation on the Mine Safety and Health Administration (MSHA) injury dataset across three classification tasks (accident type, mining equipment, and degree of injury) demonstrated consistent performance improvements, with both models outperforming diverse baseline models including general-purpose models (BERT, ALBERT, DistilBERT, RoBERTa), domain-specific scientific model (SciBERT), and large language model (Llama 3.1-8B), with safetyALBERT achieving competitive results despite its parameter-efficient design.. To further assess generalization in low-resource settings, these models were evaluated on the small-scale Alaska insurance claim dataset from mining industry across two classification tasks − claim type and injured body part. Both safetyBERT and safetyALBERT maintained strong performance under this constraint, demonstrating the value of domain adaptation for data-constrained environments. Additionally, multi-task classification on the MSHA dataset using safety domain models showed improved generalization and more balanced performance across underrepresented classes. These findings confirm that DAPT effectively enhances language understanding in safety–critical domains while enabling scalable, resource-efficient deployment. This work lays the foundation for integrating domain-adapted natural language processing (NLP) systems into occupational health and safety management frameworks.
弥合安全特定语言模型的差距:跨几个工业部门的基于变压器的模型的领域自适应预训练,用于职业安全应用
尽管监管框架和安全技术取得了进步,但职业安全仍然是一个持续存在的全球挑战。非结构化的事件叙述,如事故报告和安全日志,为了解工作场所的危险提供了有价值的背景,但由于安全特定语言模型的差距,未得到充分利用。本研究通过领域自适应预训练(DAPT)将预训练的基于变压器的模型(BERT和ALBERT)适应职业安全领域,从而解决了这一差距。我们构建了一个大规模的多源语料库,包含超过240万份文件,涵盖多个工业部门,包括采矿、建筑、运输和化学加工,并增加了与安全相关的学术摘要,以保持一般的语言理解并减轻灾难性遗忘。使用该语料库,通过对屏蔽语言建模目标的持续预训练,我们开发了两个领域适应模型,safeybert和safeyalbert。使用伪困惑(PPPL)的内在评估显示出实质性的改进,与通用领域的同类产品相比,safetyBERT和safyalbert的PPPL分别降低了76.9%和90.3%。对矿山安全与健康管理局(MSHA)伤害数据集的三种分类任务(事故类型、采矿设备和伤害程度)的外部评估显示出一致的性能改进,两种模型都优于各种基线模型,包括通用模型(BERT、ALBERT、DistilBERT、RoBERTa)、特定领域科学模型(SciBERT)和大型语言模型(Llama 3.1-8B)。与safyalbert实现竞争结果,尽管其参数高效的设计…为了进一步评估这些模型在低资源环境下的泛化效果,研究人员在小规模的阿拉斯加采矿业保险索赔数据集上对两个分类任务(索赔类型和受伤身体部位)进行了评估。在此约束下,securitybert和safyalbert都保持了较强的性能,证明了数据约束环境下域适应的价值。此外,使用安全域模型在MSHA数据集上进行多任务分类,在代表性不足的类上显示出更好的泛化和更平衡的性能。这些发现证实,DAPT在支持可扩展、资源高效部署的同时,有效地增强了对安全关键领域的语言理解。这项工作为将领域适应的自然语言处理(NLP)系统集成到职业健康和安全管理框架中奠定了基础。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Expert Systems with Applications
Expert Systems with Applications 工程技术-工程:电子与电气
CiteScore
13.80
自引率
10.60%
发文量
2045
审稿时长
8.7 months
期刊介绍: Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信