Does synthetic data augmentation improve the performances of machine learning classifiers for identifying health problems in patient–nurse verbal communications in home healthcare settings?

IF 2.4 3区医学 Q1 NURSING

Journal of Nursing Scholarship Pub Date : 2024-07-03 DOI:10.1111/jnu.13004

Jihye Kim Scroggins PhD, RN, Maxim Topaz PhD, RN, Jiyoun Song PhD, RN, Maryam Zolnoori PhD

{"title":"Does synthetic data augmentation improve the performances of machine learning classifiers for identifying health problems in patient–nurse verbal communications in home healthcare settings?","authors":"Jihye Kim Scroggins PhD, RN, Maxim Topaz PhD, RN, Jiyoun Song PhD, RN, Maryam Zolnoori PhD","doi":"10.1111/jnu.13004","DOIUrl":null,"url":null,"abstract":"<div>\n \n \n <section>\n \n <h3> Background</h3>\n \n <p>Identifying health problems in audio-recorded patient–nurse communication is important to improve outcomes in home healthcare patients who have complex conditions with increased risks of hospital utilization. Training machine learning classifiers for identifying problems requires resource-intensive human annotation.</p>\n </section>\n \n <section>\n \n <h3> Objective</h3>\n \n <p>To generate synthetic patient–nurse communication and to automatically annotate for common health problems encountered in home healthcare settings using GPT-4. We also examined whether augmenting real-world patient–nurse communication with synthetic data can improve the performance of machine learning to identify health problems.</p>\n </section>\n \n <section>\n \n <h3> Design</h3>\n \n <p>Secondary data analysis of patient–nurse verbal communication data in home healthcare settings.</p>\n </section>\n \n <section>\n \n <h3> Methods</h3>\n \n <p>The data were collected from one of the largest home healthcare organizations in the United States. We used 23 audio recordings of patient–nurse communications from 15 patients. The audio recordings were transcribed verbatim and manually annotated for health problems (e.g., circulation, skin, pain) indicated in the Omaha System Classification scheme. Synthetic data of patient–nurse communication were generated using the in-context learning prompting method, enhanced by chain-of-thought prompting to improve the automatic annotation performance. Machine learning classifiers were applied to three training datasets: real-world communication, synthetic communication, and real-world communication augmented by synthetic communication.</p>\n </section>\n \n <section>\n \n <h3> Results</h3>\n \n <p>Average <i>F</i>1 scores improved from 0.62 to 0.63 after training data were augmented with synthetic communication. The largest increase was observed using the XGBoost classifier where <i>F</i>1 scores improved from 0.61 to 0.64 (about 5% improvement). When trained solely on either real-world communication or synthetic communication, the classifiers showed comparable <i>F</i>1 scores of 0.62–0.61, respectively.</p>\n </section>\n \n <section>\n \n <h3> Conclusion</h3>\n \n <p>Integrating synthetic data improves machine learning classifiers' ability to identify health problems in home healthcare, with performance comparable to training on real-world data alone, highlighting the potential of synthetic data in healthcare analytics.</p>\n </section>\n \n <section>\n \n <h3> Clinical Relevance</h3>\n \n <p>This study demonstrates the clinical relevance of leveraging synthetic patient–nurse communication data to enhance machine learning classifier performances to identify health problems in home healthcare settings, which will contribute to more accurate and efficient problem identification and detection of home healthcare patients with complex health conditions.</p>\n </section>\n </div>","PeriodicalId":51091,"journal":{"name":"Journal of Nursing Scholarship","volume":"57 1","pages":"47-58"},"PeriodicalIF":2.4000,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://onlinelibrary.wiley.com/doi/epdf/10.1111/jnu.13004","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Nursing Scholarship","FirstCategoryId":"3","ListUrlMain":"https://onlinelibrary.wiley.com/doi/10.1111/jnu.13004","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"NURSING","Score":null,"Total":0}

引用次数: 0

Abstract

Background

Identifying health problems in audio-recorded patient–nurse communication is important to improve outcomes in home healthcare patients who have complex conditions with increased risks of hospital utilization. Training machine learning classifiers for identifying problems requires resource-intensive human annotation.

Objective

To generate synthetic patient–nurse communication and to automatically annotate for common health problems encountered in home healthcare settings using GPT-4. We also examined whether augmenting real-world patient–nurse communication with synthetic data can improve the performance of machine learning to identify health problems.

Design

Secondary data analysis of patient–nurse verbal communication data in home healthcare settings.

Methods

The data were collected from one of the largest home healthcare organizations in the United States. We used 23 audio recordings of patient–nurse communications from 15 patients. The audio recordings were transcribed verbatim and manually annotated for health problems (e.g., circulation, skin, pain) indicated in the Omaha System Classification scheme. Synthetic data of patient–nurse communication were generated using the in-context learning prompting method, enhanced by chain-of-thought prompting to improve the automatic annotation performance. Machine learning classifiers were applied to three training datasets: real-world communication, synthetic communication, and real-world communication augmented by synthetic communication.

Results

Average F1 scores improved from 0.62 to 0.63 after training data were augmented with synthetic communication. The largest increase was observed using the XGBoost classifier where F1 scores improved from 0.61 to 0.64 (about 5% improvement). When trained solely on either real-world communication or synthetic communication, the classifiers showed comparable F1 scores of 0.62–0.61, respectively.

Conclusion

Integrating synthetic data improves machine learning classifiers' ability to identify health problems in home healthcare, with performance comparable to training on real-world data alone, highlighting the potential of synthetic data in healthcare analytics.

Clinical Relevance

This study demonstrates the clinical relevance of leveraging synthetic patient–nurse communication data to enhance machine learning classifier performances to identify health problems in home healthcare settings, which will contribute to more accurate and efficient problem identification and detection of home healthcare patients with complex health conditions.

查看原文本刊更多论文

合成数据扩增是否能提高机器学习分类器的性能，从而识别家庭医疗环境中病人与护士口头交流中的健康问题？

背景：对于病情复杂、住院风险较高的居家医疗患者来说，从患者与护士的交流录音中识别健康问题对于改善治疗效果非常重要。训练机器学习分类器来识别问题需要资源密集型的人工标注：目的：使用 GPT-4 生成合成的患者-护士交流，并自动注释家庭医疗环境中常见的健康问题。我们还研究了用合成数据增强真实世界中的护患沟通是否能提高机器学习识别健康问题的性能：设计：对家庭医疗环境中患者与护士的口头交流数据进行二次数据分析：数据收集自美国最大的家庭医疗机构之一。我们使用了来自 15 名患者的 23 份患者与护士沟通的录音。录音被逐字转录，并根据奥马哈系统分类方案中指出的健康问题（如血液循环、皮肤、疼痛）进行人工注释。使用上下文学习提示法生成病人与护士交流的合成数据，并通过思维链提示来提高自动注释性能。机器学习分类器被应用于三个训练数据集：真实世界交流、合成交流和由合成交流增强的真实世界交流：结果：在训练数据中添加合成通信后，平均 F1 分数从 0.62 提高到 0.63。使用 XGBoost 分类器观察到的增幅最大，F1 分数从 0.61 提高到 0.64（约提高 5%）。当仅在真实世界通信或合成通信中进行训练时，分类器的 F1 分数分别为 0.62-0.61 分，具有可比性：整合合成数据提高了机器学习分类器识别家庭医疗保健中健康问题的能力，其性能与仅在真实世界数据上进行的训练相当，凸显了合成数据在医疗保健分析中的潜力：这项研究表明，利用合成的患者与护士交流数据来提高机器学习分类器识别家庭医疗环境中健康问题的性能具有临床意义，这将有助于更准确、更高效地识别和检测患有复杂健康问题的家庭医疗患者。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Nursing Scholarship 医学-护理

CiteScore

6.30

自引率

5.90%

发文量

审稿时长

6-12 weeks

期刊介绍： This widely read and respected journal features peer-reviewed, thought-provoking articles representing research by some of the world’s leading nurse researchers. Reaching health professionals, faculty and students in 103 countries, the Journal of Nursing Scholarship is focused on health of people throughout the world. It is the official journal of Sigma Theta Tau International and it reflects the society’s dedication to providing the tools necessary to improve nursing care around the world.