Occupation classification model based on DistilKoBERT: using the 5th and 6th Korean Working Condition Surveys.

IF 1.2 Q4 PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH
Annals of Occupational and Environmental Medicine Pub Date : 2024-08-06 eCollection Date: 2024-01-01 DOI:10.35371/aoem.2024.36.e19
Tae-Yeon Kim, Seong-Uk Baek, Myeong-Hun Lim, Byungyoon Yun, Domyung Paek, Kyung Ehi Zoh, Kanwoo Youn, Yun Keun Lee, Yangho Kim, Jungwon Kim, Eunsuk Choi, Mo-Yeol Kang, YoonHo Cho, Kyung-Eun Lee, Juho Sim, Juyeon Oh, Heejoo Park, Jian Lee, Jong-Uk Won, Yu-Min Lee, Jin-Ha Yoon
{"title":"Occupation classification model based on DistilKoBERT: using the 5th and 6th Korean Working Condition Surveys.","authors":"Tae-Yeon Kim, Seong-Uk Baek, Myeong-Hun Lim, Byungyoon Yun, Domyung Paek, Kyung Ehi Zoh, Kanwoo Youn, Yun Keun Lee, Yangho Kim, Jungwon Kim, Eunsuk Choi, Mo-Yeol Kang, YoonHo Cho, Kyung-Eun Lee, Juho Sim, Juyeon Oh, Heejoo Park, Jian Lee, Jong-Uk Won, Yu-Min Lee, Jin-Ha Yoon","doi":"10.35371/aoem.2024.36.e19","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Accurate occupation classification is essential in various fields, including policy development and epidemiological studies. This study aims to develop an occupation classification model based on DistilKoBERT.</p><p><strong>Methods: </strong>This study used data from the 5th and 6th Korean Working Conditions Surveys conducted in 2017 and 2020, respectively. A total of 99,665 survey participants, who were nationally representative of Korean workers, were included. We used natural language responses regarding their job responsibilities and occupational codes based on the Korean Standard Classification of Occupations (7th version, 3-digit codes). The dataset was randomly split into training and test datasets in a ratio of 7:3. The occupation classification model based on DistilKoBERT was fine-tuned using the training dataset, and the model was evaluated using the test dataset. The accuracy, precision, recall, and F1 score were calculated as evaluation metrics.</p><p><strong>Results: </strong>The final model, which classified 28,996 survey participants in the test dataset into 142 occupational codes, exhibited an accuracy of 84.44%. For the evaluation metrics, the precision, recall, and F1 score of the model, calculated by weighting based on the sample size, were 0.83, 0.84, and 0.83, respectively. The model demonstrated high precision in the classification of service and sales workers yet exhibited low precision in the classification of managers. In addition, it displayed high precision in classifying occupations prominently represented in the training dataset.</p><p><strong>Conclusions: </strong>This study developed an occupation classification system based on DistilKoBERT, which demonstrated reasonable performance. Despite further efforts to enhance the classification accuracy, this automated occupation classification model holds promise for advancing epidemiological studies in the fields of occupational safety and health.</p>","PeriodicalId":46631,"journal":{"name":"Annals of Occupational and Environmental Medicine","volume":null,"pages":null},"PeriodicalIF":1.2000,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11345209/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Annals of Occupational and Environmental Medicine","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.35371/aoem.2024.36.e19","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q4","JCRName":"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Accurate occupation classification is essential in various fields, including policy development and epidemiological studies. This study aims to develop an occupation classification model based on DistilKoBERT.

Methods: This study used data from the 5th and 6th Korean Working Conditions Surveys conducted in 2017 and 2020, respectively. A total of 99,665 survey participants, who were nationally representative of Korean workers, were included. We used natural language responses regarding their job responsibilities and occupational codes based on the Korean Standard Classification of Occupations (7th version, 3-digit codes). The dataset was randomly split into training and test datasets in a ratio of 7:3. The occupation classification model based on DistilKoBERT was fine-tuned using the training dataset, and the model was evaluated using the test dataset. The accuracy, precision, recall, and F1 score were calculated as evaluation metrics.

Results: The final model, which classified 28,996 survey participants in the test dataset into 142 occupational codes, exhibited an accuracy of 84.44%. For the evaluation metrics, the precision, recall, and F1 score of the model, calculated by weighting based on the sample size, were 0.83, 0.84, and 0.83, respectively. The model demonstrated high precision in the classification of service and sales workers yet exhibited low precision in the classification of managers. In addition, it displayed high precision in classifying occupations prominently represented in the training dataset.

Conclusions: This study developed an occupation classification system based on DistilKoBERT, which demonstrated reasonable performance. Despite further efforts to enhance the classification accuracy, this automated occupation classification model holds promise for advancing epidemiological studies in the fields of occupational safety and health.

基于 DistilKoBERT 的职业分类模型:利用韩国第五次和第六次工作条件调查。
背景:准确的职业分类对政策制定和流行病学研究等多个领域都至关重要。本研究旨在开发基于 DistilKoBERT 的职业分类模型:本研究使用了分别于 2017 年和 2020 年进行的第五次和第六次韩国工作条件调查的数据。共有 99,665 名调查参与者参与其中,他们在韩国工人中具有全国代表性。我们使用自然语言回答他们的工作职责,并根据韩国职业标准分类(第 7 版,三位数代码)进行职业编码。数据集按 7:3 的比例随机分为训练数据集和测试数据集。使用训练数据集对基于 DistilKoBERT 的职业分类模型进行了微调,并使用测试数据集对该模型进行了评估。评估指标包括准确率、精确率、召回率和 F1 分数:最终模型将测试数据集中的 28 996 名调查参与者归类为 142 个职业代码,准确率为 84.44%。在评价指标方面,根据样本量加权计算得出的模型精确度、召回率和 F1 分数分别为 0.83、0.84 和 0.83。该模型在服务人员和销售人员的分类中表现出较高的精确度,但在管理人员的分类中却表现出较低的精确度。此外,该模型在对训练数据集中突出的职业进行分类时也表现出较高的精确度:本研究开发了基于 DistilKoBERT 的职业分类系统,该系统表现出了合理的性能。尽管还需要进一步努力提高分类精度,但这一自动化职业分类模型有望推动职业安全与健康领域的流行病学研究。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Annals of Occupational and Environmental Medicine
Annals of Occupational and Environmental Medicine PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH-
CiteScore
1.90
自引率
0.00%
发文量
25
审稿时长
16 weeks
期刊介绍: Annals of Occupational and Environmental Medicine (AOEM) is an open access journal that considers original contributions relevant to occupational and environmental medicine and related fields, in the form of original articles, review articles, short letters and case reports. AOEM is aimed at clinicians and researchers working in the wide-ranging discipline of occupational and environmental medicine. Topic areas focus on, but are not limited to, interactions between work and health, covering occupational and environmental epidemiology, toxicology, hygiene, diagnosis and treatment of diseases, management, organization and policy. As the official journal of the Korean Society of Occupational and Environmental Medicine (KSOEM), members and authors based in the Republic of Korea are entitled to a discounted article-processing charge when they publish in AOEM.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信