Bridging the safety-specific language model gap: Domain-adaptive pretraining of transformer-based models across several industrial sectors for occupational safety applications
IF 7.5 1区 计算机科学Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
{"title":"Bridging the safety-specific language model gap: Domain-adaptive pretraining of transformer-based models across several industrial sectors for occupational safety applications","authors":"Abid Ali Khan Danish, Snehamoy Chatterjee","doi":"10.1016/j.eswa.2025.130068","DOIUrl":null,"url":null,"abstract":"<div><div>Occupational safety remains a persistent global challenge despite advancements in regulatory frameworks and safety technologies. Unstructured incident narratives, such as accident reports and safety logs, offer valuable context for understanding workplace hazards but are underutilized due to the gap in the safety-specific language models. This study addresses that gap by adapting pretrained transformer-based models (BERT and ALBERT) to the occupational safety domain through Domain-Adaptive Pretraining (DAPT). We construct a large-scale, multi-source corpus comprising over 2.4 million documents spanning several industrial sectors, including mining, construction, transportation, and chemical processing, augmented with safety-related academic abstracts to preserve general linguistic understanding and mitigate catastrophic forgetting. Using this corpus, we develop two domain-adapted models, safetyBERT and safetyALBERT, through continual pretraining on the masked language modeling objective. Intrinsic evaluation using pseudo-perplexity (PPPL) demonstrates substantial improvements, with safetyBERT and safetyALBERT achieving 76.9% and 90.3% reductions in PPPL, respectively, over their general-domain counterparts. Extrinsic evaluation on the Mine Safety and Health Administration (MSHA) injury dataset across three classification tasks (accident type, mining equipment, and degree of injury) demonstrated consistent performance improvements, with both models outperforming diverse baseline models including general-purpose models (BERT, ALBERT, DistilBERT, RoBERTa), domain-specific scientific model (SciBERT), and large language model (Llama 3.1-8B), with safetyALBERT achieving competitive results despite its parameter-efficient design.. To further assess generalization in low-resource settings, these models were evaluated on the small-scale Alaska insurance claim dataset from mining industry across two classification tasks − claim type and injured body part. Both safetyBERT and safetyALBERT maintained strong performance under this constraint, demonstrating the value of domain adaptation for data-constrained environments. Additionally, multi-task classification on the MSHA dataset using safety domain models showed improved generalization and more balanced performance across underrepresented classes. These findings confirm that DAPT effectively enhances language understanding in safety–critical domains while enabling scalable, resource-efficient deployment. This work lays the foundation for integrating domain-adapted natural language processing (NLP) systems into occupational health and safety management frameworks.</div></div>","PeriodicalId":50461,"journal":{"name":"Expert Systems with Applications","volume":"299 ","pages":"Article 130068"},"PeriodicalIF":7.5000,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Expert Systems with Applications","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S095741742503684X","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0
Abstract
Occupational safety remains a persistent global challenge despite advancements in regulatory frameworks and safety technologies. Unstructured incident narratives, such as accident reports and safety logs, offer valuable context for understanding workplace hazards but are underutilized due to the gap in the safety-specific language models. This study addresses that gap by adapting pretrained transformer-based models (BERT and ALBERT) to the occupational safety domain through Domain-Adaptive Pretraining (DAPT). We construct a large-scale, multi-source corpus comprising over 2.4 million documents spanning several industrial sectors, including mining, construction, transportation, and chemical processing, augmented with safety-related academic abstracts to preserve general linguistic understanding and mitigate catastrophic forgetting. Using this corpus, we develop two domain-adapted models, safetyBERT and safetyALBERT, through continual pretraining on the masked language modeling objective. Intrinsic evaluation using pseudo-perplexity (PPPL) demonstrates substantial improvements, with safetyBERT and safetyALBERT achieving 76.9% and 90.3% reductions in PPPL, respectively, over their general-domain counterparts. Extrinsic evaluation on the Mine Safety and Health Administration (MSHA) injury dataset across three classification tasks (accident type, mining equipment, and degree of injury) demonstrated consistent performance improvements, with both models outperforming diverse baseline models including general-purpose models (BERT, ALBERT, DistilBERT, RoBERTa), domain-specific scientific model (SciBERT), and large language model (Llama 3.1-8B), with safetyALBERT achieving competitive results despite its parameter-efficient design.. To further assess generalization in low-resource settings, these models were evaluated on the small-scale Alaska insurance claim dataset from mining industry across two classification tasks − claim type and injured body part. Both safetyBERT and safetyALBERT maintained strong performance under this constraint, demonstrating the value of domain adaptation for data-constrained environments. Additionally, multi-task classification on the MSHA dataset using safety domain models showed improved generalization and more balanced performance across underrepresented classes. These findings confirm that DAPT effectively enhances language understanding in safety–critical domains while enabling scalable, resource-efficient deployment. This work lays the foundation for integrating domain-adapted natural language processing (NLP) systems into occupational health and safety management frameworks.
期刊介绍:
Expert Systems With Applications is an international journal dedicated to the exchange of information on expert and intelligent systems used globally in industry, government, and universities. The journal emphasizes original papers covering the design, development, testing, implementation, and management of these systems, offering practical guidelines. It spans various sectors such as finance, engineering, marketing, law, project management, information management, medicine, and more. The journal also welcomes papers on multi-agent systems, knowledge management, neural networks, knowledge discovery, data mining, and other related areas, excluding applications to military/defense systems.