使用命名实体识别和关系提取管道生成未来技能识别的动态分类法。

IF 4.7 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Frontiers in Artificial Intelligence Pub Date : 2025-07-02 eCollection Date: 2025-01-01 DOI:10.3389/frai.2025.1579998

Luis Jose Gonzalez-Gomez, Sofia Margarita Hernandez-Munoz, Abiel Borja, Fernando A Arana-Salas, Jose Daniel Azofeifa, Julieta Noguez, Patricia Caratozzolo

{"title":"使用命名实体识别和关系提取管道生成未来技能识别的动态分类法。","authors":"Luis Jose Gonzalez-Gomez, Sofia Margarita Hernandez-Munoz, Abiel Borja, Fernando A Arana-Salas, Jose Daniel Azofeifa, Julieta Noguez, Patricia Caratozzolo","doi":"10.3389/frai.2025.1579998","DOIUrl":null,"url":null,"abstract":"Introduction: The labor market is rapidly evolving, leading to a mismatch between existing Knowledge, Skills, and Abilities (KSAs) and future occupational requirements. Reports from organizations like the World Economic Forum and the OECD emphasize the need for dynamic skill identification. This paper introduces a novel system for constructing a dynamic taxonomy using Natural Language Processing (NLP) techniques, specifically Named Entity Recognition (NER) and Relation Extraction (RE), to identify and predict future skills. By leveraging machine learning models, this taxonomy aims to bridge the gap between current skills and future demands, contributing to educational and professional development.Methods: To achieve this, an NLP-based architecture was developed using a combination of text preprocessing, NER, and RE models. The NER model identifies and categorizes KSAs and occupations from a corpus of labor market reports, while the RE model establishes the relationships between these entities. A custom pipeline was used for PDF text extraction, tokenization, and lemmatization to standardize the data. The models were trained and evaluated using over 1,700 annotated documents, with the training process optimized for both entity recognition and relationship prediction accuracy.Results: The NER and RE models demonstrated promising performance. The NER model achieved a best micro-averaged F1-score of 65.38% in identifying occupations, skills, and knowledge entities. The RE model subsequently achieved a best micro-F1 score of 82.2% for accurately classifying semantic relationships between these entities at epoch 1,009. The taxonomy generated from these models effectively identified emerging skills and occupations, offering insights into future workforce requirements. Visualizations of the taxonomy were created using various graph structures, demonstrating its applicability across multiple sectors. The results indicate that this system can dynamically update and adapt to changes in skill demand over time.Discussion: The dynamic taxonomy model not only provides real-time updates on current competencies but also predicts emerging skill trends, offering a valuable tool for workforce planning. The high recall rates in NER suggest strong entity recognition capabilities, though precision improvements are needed to reduce false positives. Limitations include the need for a larger corpus and sector-specific models. Future work will focus on expanding the corpus, improving model accuracy, and incorporating expert feedback to further refine the taxonomy.","PeriodicalId":33315,"journal":{"name":"Frontiers in Artificial Intelligence","volume":"8 ","pages":"1579998"},"PeriodicalIF":4.7000,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12263674/pdf/","citationCount":"0","resultStr":"{\"title\":\"Dynamic taxonomy generation for future skills identification using a named entity recognition and relation extraction pipeline.\",\"authors\":\"Luis Jose Gonzalez-Gomez, Sofia Margarita Hernandez-Munoz, Abiel Borja, Fernando A Arana-Salas, Jose Daniel Azofeifa, Julieta Noguez, Patricia Caratozzolo\",\"doi\":\"10.3389/frai.2025.1579998\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Introduction: The labor market is rapidly evolving, leading to a mismatch between existing Knowledge, Skills, and Abilities (KSAs) and future occupational requirements. Reports from organizations like the World Economic Forum and the OECD emphasize the need for dynamic skill identification. This paper introduces a novel system for constructing a dynamic taxonomy using Natural Language Processing (NLP) techniques, specifically Named Entity Recognition (NER) and Relation Extraction (RE), to identify and predict future skills. By leveraging machine learning models, this taxonomy aims to bridge the gap between current skills and future demands, contributing to educational and professional development.Methods: To achieve this, an NLP-based architecture was developed using a combination of text preprocessing, NER, and RE models. The NER model identifies and categorizes KSAs and occupations from a corpus of labor market reports, while the RE model establishes the relationships between these entities. A custom pipeline was used for PDF text extraction, tokenization, and lemmatization to standardize the data. The models were trained and evaluated using over 1,700 annotated documents, with the training process optimized for both entity recognition and relationship prediction accuracy.Results: The NER and RE models demonstrated promising performance. The NER model achieved a best micro-averaged F1-score of 65.38% in identifying occupations, skills, and knowledge entities. The RE model subsequently achieved a best micro-F1 score of 82.2% for accurately classifying semantic relationships between these entities at epoch 1,009. The taxonomy generated from these models effectively identified emerging skills and occupations, offering insights into future workforce requirements. Visualizations of the taxonomy were created using various graph structures, demonstrating its applicability across multiple sectors. The results indicate that this system can dynamically update and adapt to changes in skill demand over time.Discussion: The dynamic taxonomy model not only provides real-time updates on current competencies but also predicts emerging skill trends, offering a valuable tool for workforce planning. The high recall rates in NER suggest strong entity recognition capabilities, though precision improvements are needed to reduce false positives. Limitations include the need for a larger corpus and sector-specific models. Future work will focus on expanding the corpus, improving model accuracy, and incorporating expert feedback to further refine the taxonomy.\",\"PeriodicalId\":33315,\"journal\":{\"name\":\"Frontiers in Artificial Intelligence\",\"volume\":\"8 \",\"pages\":\"1579998\"},\"PeriodicalIF\":4.7000,\"publicationDate\":\"2025-07-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12263674/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Frontiers in Artificial Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3389/frai.2025.1579998\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"Q2\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/frai.2025.1579998","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

引言：劳动力市场正在迅速发展，导致现有知识、技能和能力（KSAs）与未来职业需求之间的不匹配。世界经济论坛（World Economic Forum）和经合组织（OECD）等组织的报告强调了动态技能识别的必要性。本文介绍了一种利用自然语言处理（NLP）技术构建动态分类的新系统，特别是命名实体识别（NER）和关系提取（RE），以识别和预测未来的技能。通过利用机器学习模型，该分类法旨在弥合当前技能和未来需求之间的差距，为教育和专业发展做出贡献。方法：为了实现这一目标，使用文本预处理、NER和RE模型相结合的方法开发了基于nlp的体系结构。NER模型从劳动力市场报告的语料库中识别和分类ksa和职业，而RE模型则建立这些实体之间的关系。一个自定义管道用于PDF文本提取、标记化和词序化，以标准化数据。模型使用超过1700个带注释的文档进行训练和评估，训练过程针对实体识别和关系预测精度进行了优化。结果：NER和RE模型表现出良好的性能。NER模型在识别职业、技能和知识实体方面的微观平均f1得分最高，达到65.38%。随后，RE模型在epoch 1009准确分类这些实体之间的语义关系方面获得了82.2%的最佳micro-F1分数。从这些模型中生成的分类有效地识别了新兴技能和职业，为未来的劳动力需求提供了见解。使用各种图结构创建了分类法的可视化，展示了它在多个领域的适用性。结果表明，该系统能够随着时间的推移动态更新和适应技能需求的变化。讨论：动态分类模型不仅提供当前能力的实时更新，而且还预测新出现的技能趋势，为劳动力规划提供了有价值的工具。NER的高召回率表明了强大的实体识别能力，尽管需要提高精度以减少误报。限制包括需要更大的语料库和特定于部门的模型。未来的工作将集中在扩展语料库，提高模型准确性，并结合专家反馈来进一步完善分类法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Dynamic taxonomy generation for future skills identification using a named entity recognition and relation extraction pipeline.

Introduction: The labor market is rapidly evolving, leading to a mismatch between existing Knowledge, Skills, and Abilities (KSAs) and future occupational requirements. Reports from organizations like the World Economic Forum and the OECD emphasize the need for dynamic skill identification. This paper introduces a novel system for constructing a dynamic taxonomy using Natural Language Processing (NLP) techniques, specifically Named Entity Recognition (NER) and Relation Extraction (RE), to identify and predict future skills. By leveraging machine learning models, this taxonomy aims to bridge the gap between current skills and future demands, contributing to educational and professional development.

Methods: To achieve this, an NLP-based architecture was developed using a combination of text preprocessing, NER, and RE models. The NER model identifies and categorizes KSAs and occupations from a corpus of labor market reports, while the RE model establishes the relationships between these entities. A custom pipeline was used for PDF text extraction, tokenization, and lemmatization to standardize the data. The models were trained and evaluated using over 1,700 annotated documents, with the training process optimized for both entity recognition and relationship prediction accuracy.

Results: The NER and RE models demonstrated promising performance. The NER model achieved a best micro-averaged F1-score of 65.38% in identifying occupations, skills, and knowledge entities. The RE model subsequently achieved a best micro-F1 score of 82.2% for accurately classifying semantic relationships between these entities at epoch 1,009. The taxonomy generated from these models effectively identified emerging skills and occupations, offering insights into future workforce requirements. Visualizations of the taxonomy were created using various graph structures, demonstrating its applicability across multiple sectors. The results indicate that this system can dynamically update and adapt to changes in skill demand over time.

Discussion: The dynamic taxonomy model not only provides real-time updates on current competencies but also predicts emerging skill trends, offering a valuable tool for workforce planning. The high recall rates in NER suggest strong entity recognition capabilities, though precision improvements are needed to reduce false positives. Limitations include the need for a larger corpus and sector-specific models. Future work will focus on expanding the corpus, improving model accuracy, and incorporating expert feedback to further refine the taxonomy.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊