A comparative analysis of machine learning models and human expertise for nursing intervention classification.

IF 3.4 Q2 HEALTH CARE SCIENCES & SERVICES

JAMIA Open Pub Date : 2025-06-27 eCollection Date: 2025-06-01 DOI:10.1093/jamiaopen/ooaf057

Jerome Niyirora, Lynne Longtin, Cynthia Grabski, David Patrishkoff, Andriana Semko

{"title":"A comparative analysis of machine learning models and human expertise for nursing intervention classification.","authors":"Jerome Niyirora, Lynne Longtin, Cynthia Grabski, David Patrishkoff, Andriana Semko","doi":"10.1093/jamiaopen/ooaf057","DOIUrl":null,"url":null,"abstract":"Objective: This study compares the performance of machine learning (ML) models and human experts in mapping unstructured nursing notes to the standardized Nursing Interventions Classification (NIC) system. The aim is to advance automated nursing documentation classification, facilitating cross-facility benchmarking of patient care and organizational outcomes.Materials and methods: We developed and compared 4 ML models: TF-IDF text-based vectorization, UMLS semantic mapping, fine-tuned GPT-4o mini, and Bio-Clinical BERT. These models were evaluated against classifications provided by 2 expert nurses using a dataset of de-identified home healthcare nursing notes obtained from a Florida, USA-based medical clearinghouse. Model performance was assessed using agreement statistics, precision, recall, F1 scores, and Cohen's Kappa.Results: Human raters achieved the highest agreement with consensus labels, scoring 0.75 and 0.62, with corresponding F1 scores of 0.61 and 0.45, respectively. In comparison, ML models showed lower performance, with GPT achieving the best among them (agreement: 0.50, F1 score: 0.31). A distribution analysis of NIC categories revealed that ML models performed well in prevalent and clearly defined categories, such as drug management, but struggled with minority classes and context-dependent interventions, like information management.Discussion: Current ML approaches show promise in supporting clinical classification tasks, but the performance gap in handling complex, context-dependent interventions highlights the need for improved methods that can better capture the nuanced nature of clinical documentation. Future research should focus on developing methods to process clinical terminology and context-specific documentation with greater precision and adaptability.Conclusion: Current ML models can aid-but not fully replace-human judgment in classifying nuanced nursing interventions.","PeriodicalId":36278,"journal":{"name":"JAMIA Open","volume":"8 3","pages":"ooaf057"},"PeriodicalIF":3.4000,"publicationDate":"2025-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12203540/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JAMIA Open","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1093/jamiaopen/ooaf057","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/6/1 0:00:00","PubModel":"eCollection","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Objective: This study compares the performance of machine learning (ML) models and human experts in mapping unstructured nursing notes to the standardized Nursing Interventions Classification (NIC) system. The aim is to advance automated nursing documentation classification, facilitating cross-facility benchmarking of patient care and organizational outcomes.

Materials and methods: We developed and compared 4 ML models: TF-IDF text-based vectorization, UMLS semantic mapping, fine-tuned GPT-4o mini, and Bio-Clinical BERT. These models were evaluated against classifications provided by 2 expert nurses using a dataset of de-identified home healthcare nursing notes obtained from a Florida, USA-based medical clearinghouse. Model performance was assessed using agreement statistics, precision, recall, F1 scores, and Cohen's Kappa.

Results: Human raters achieved the highest agreement with consensus labels, scoring 0.75 and 0.62, with corresponding F1 scores of 0.61 and 0.45, respectively. In comparison, ML models showed lower performance, with GPT achieving the best among them (agreement: 0.50, F1 score: 0.31). A distribution analysis of NIC categories revealed that ML models performed well in prevalent and clearly defined categories, such as drug management, but struggled with minority classes and context-dependent interventions, like information management.

Discussion: Current ML approaches show promise in supporting clinical classification tasks, but the performance gap in handling complex, context-dependent interventions highlights the need for improved methods that can better capture the nuanced nature of clinical documentation. Future research should focus on developing methods to process clinical terminology and context-specific documentation with greater precision and adaptability.

Conclusion: Current ML models can aid-but not fully replace-human judgment in classifying nuanced nursing interventions.

Abstract Image

查看原文本刊更多论文

护理干预分类中机器学习模型与人类专业知识的比较分析。

目的：比较机器学习（ML）模型和人类专家在将非结构化护理笔记映射到标准化护理干预分类（NIC）系统中的表现。目的是推进自动化护理文件分类，促进患者护理和组织结果的跨设施基准。材料和方法：我们开发并比较了4种ML模型：基于TF-IDF文本的矢量化，UMLS语义映射，微调gpt - 40mini和生物临床BERT。这些模型是根据2名专家护士提供的分类进行评估的，这些分类使用了从美国佛罗里达州的医疗信息交换所获得的去识别的家庭保健护理笔记数据集。使用协议统计、精度、召回率、F1分数和Cohen’s Kappa来评估模型性能。结果：人类评分者与共识标签的一致性最高，得分分别为0.75和0.62，相应的F1得分分别为0.61和0.45。相比之下，ML模型的性能较低，其中GPT达到最佳（一致性：0.50，F1分数：0.31）。对NIC类别的分布分析显示，ML模型在流行和明确定义的类别（如药物管理）中表现良好，但在少数类别和上下文相关干预（如信息管理）中表现不佳。讨论：当前的机器学习方法在支持临床分类任务方面显示出希望，但是在处理复杂的、上下文相关的干预措施方面的性能差距突出了对改进方法的需求，这些方法可以更好地捕捉临床文档的细微差别。未来的研究应侧重于开发处理临床术语和上下文特定文件的方法，以更高的精度和适应性。结论：目前的机器学习模型可以帮助-但不能完全取代-人类对细致护理干预的分类判断。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊