专有大语言模型在标记产科事故报告中的准确性。

IF 2.3 Q2 HEALTH CARE SCIENCES & SERVICES

Joint Commission journal on quality and patient safety Pub Date : 2024-08-06 DOI:10.1016/j.jcjq.2024.08.001

Jeanene Johnson MPH, BSN (is Quality Improvement Advisor, Quality Improvement Department, Stanford Medicine Children's Health, Palo Alto, California.), Conner Brown BS (is Data Scientist, Stanford Medicine Children's Health.), Grace Lee MD, MPH (is Professor, Department of Pediatrics, Stanford University School of Medicine, and Chief Quality Officer, Stanford Medicine Children's Health.), Keith Morse MD, MBA (is Clinical Associate Professor, Department of Pediatrics, Stanford University School of Medicine, and Medical Director of Clinical Informatics, Stanford Medicine Children's Health)

{"title":"专有大语言模型在标记产科事故报告中的准确性。","authors":"Jeanene Johnson MPH, BSN (is Quality Improvement Advisor, Quality Improvement Department, Stanford Medicine Children's Health, Palo Alto, California.), Conner Brown BS (is Data Scientist, Stanford Medicine Children's Health.), Grace Lee MD, MPH (is Professor, Department of Pediatrics, Stanford University School of Medicine, and Chief Quality Officer, Stanford Medicine Children's Health.), Keith Morse MD, MBA (is Clinical Associate Professor, Department of Pediatrics, Stanford University School of Medicine, and Medical Director of Clinical Informatics, Stanford Medicine Children's Health)","doi":"10.1016/j.jcjq.2024.08.001","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>Using the data collected through incident reporting systems is challenging, as it is a large volume of primarily qualitative information. Large language models (LLMs), such as ChatGPT, provide novel capabilities in text summarization and labeling that could support safety data trending and early identification of opportunities to prevent patient harm. This study assessed the capability of a proprietary LLM (GPT-3.5) to automatically label a cross-sectional sample of real-world obstetric incident reports.</div></div><div><h3>Methods</h3><div>A sample of 370 incident reports submitted to inpatient obstetric units between December 2022 and May 2023 was extracted. Human-annotated labels were assigned by a clinician reviewer and considered gold standard. The LLM was prompted to label incident reports relying solely on its pretrained knowledge and information included in the prompt. Primary outcomes assessed were sensitivity, specificity, positive predictive value, and negative predictive value. A secondary outcome assessed the human-perceived quality of the model's justification for the label(s) applied.</div></div><div><h3>Results</h3><div>The LLM demonstrated the ability to label incident reports with high sensitivity and specificity. The model applied a total of 79 labels compared to the reviewer's 49 labels. Overall sensitivity for the model was 85.7%, and specificity was 97.9%. Positive and negative predictive values were 53.2% and 99.6%, respectively. For 60.8% of labels, the reviewer approved of the model's justification for applying the label.</div></div><div><h3>Conclusion</h3><div>The proprietary LLM demonstrated the ability to label obstetric incident reports with high sensitivity and specificity. LLMs offer the potential to enable more efficient use of data from incident reporting systems.</div></div>","PeriodicalId":14835,"journal":{"name":"Joint Commission journal on quality and patient safety","volume":"50 12","pages":"Pages 877-881"},"PeriodicalIF":2.3000,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Accuracy of a Proprietary Large Language Model in Labeling Obstetric Incident Reports\",\"authors\":\"Jeanene Johnson MPH, BSN (is Quality Improvement Advisor, Quality Improvement Department, Stanford Medicine Children's Health, Palo Alto, California.), Conner Brown BS (is Data Scientist, Stanford Medicine Children's Health.), Grace Lee MD, MPH (is Professor, Department of Pediatrics, Stanford University School of Medicine, and Chief Quality Officer, Stanford Medicine Children's Health.), Keith Morse MD, MBA (is Clinical Associate Professor, Department of Pediatrics, Stanford University School of Medicine, and Medical Director of Clinical Informatics, Stanford Medicine Children's Health)\",\"doi\":\"10.1016/j.jcjq.2024.08.001\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Background</h3><div>Using the data collected through incident reporting systems is challenging, as it is a large volume of primarily qualitative information. Large language models (LLMs), such as ChatGPT, provide novel capabilities in text summarization and labeling that could support safety data trending and early identification of opportunities to prevent patient harm. This study assessed the capability of a proprietary LLM (GPT-3.5) to automatically label a cross-sectional sample of real-world obstetric incident reports.</div></div><div><h3>Methods</h3><div>A sample of 370 incident reports submitted to inpatient obstetric units between December 2022 and May 2023 was extracted. Human-annotated labels were assigned by a clinician reviewer and considered gold standard. The LLM was prompted to label incident reports relying solely on its pretrained knowledge and information included in the prompt. Primary outcomes assessed were sensitivity, specificity, positive predictive value, and negative predictive value. A secondary outcome assessed the human-perceived quality of the model's justification for the label(s) applied.</div></div><div><h3>Results</h3><div>The LLM demonstrated the ability to label incident reports with high sensitivity and specificity. The model applied a total of 79 labels compared to the reviewer's 49 labels. Overall sensitivity for the model was 85.7%, and specificity was 97.9%. Positive and negative predictive values were 53.2% and 99.6%, respectively. For 60.8% of labels, the reviewer approved of the model's justification for applying the label.</div></div><div><h3>Conclusion</h3><div>The proprietary LLM demonstrated the ability to label obstetric incident reports with high sensitivity and specificity. LLMs offer the potential to enable more efficient use of data from incident reporting systems.</div></div>\",\"PeriodicalId\":14835,\"journal\":{\"name\":\"Joint Commission journal on quality and patient safety\",\"volume\":\"50 12\",\"pages\":\"Pages 877-881\"},\"PeriodicalIF\":2.3000,\"publicationDate\":\"2024-08-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Joint Commission journal on quality and patient safety\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1553725024002332\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Joint Commission journal on quality and patient safety","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1553725024002332","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

摘要

背景：使用事故报告系统收集的数据具有挑战性，因为这些数据主要是大量的定性信息。大型语言模型（LLM），如 ChatGPT，在文本总结和标注方面提供了新的功能，可以支持安全数据趋势和早期识别机会，以防止对患者造成伤害。本研究评估了专利语言模型（GPT-3.5）自动标注真实世界产科事件报告横截面样本的能力：提取了 2022 年 12 月至 2023 年 5 月间提交给产科住院部的 370 份事件报告样本。人工标注的标签由临床医生审核员指定，被视为黄金标准。LLM 仅根据其预先训练的知识和提示中包含的信息对事件报告进行标记。评估的主要结果包括灵敏度、特异性、阳性预测值和阴性预测值。次要结果是评估人类对模型贴标签理由的感知质量：结果：结果表明，LLM 能够以较高的灵敏度和特异性为事件报告贴标签。该模型共使用了 79 个标签，而审核员使用了 49 个标签。该模型的总体灵敏度为 85.7%，特异性为 97.9%。阳性和阴性预测值分别为 53.2% 和 99.6%。对于 60.8% 的标签，评审员认可模型应用标签的理由：专有的 LLM 展示了以高灵敏度和特异性对产科事故报告进行标记的能力。LLM 有助于更有效地利用事故报告系统中的数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Accuracy of a Proprietary Large Language Model in Labeling Obstetric Incident Reports

Background

Using the data collected through incident reporting systems is challenging, as it is a large volume of primarily qualitative information. Large language models (LLMs), such as ChatGPT, provide novel capabilities in text summarization and labeling that could support safety data trending and early identification of opportunities to prevent patient harm. This study assessed the capability of a proprietary LLM (GPT-3.5) to automatically label a cross-sectional sample of real-world obstetric incident reports.

Methods

A sample of 370 incident reports submitted to inpatient obstetric units between December 2022 and May 2023 was extracted. Human-annotated labels were assigned by a clinician reviewer and considered gold standard. The LLM was prompted to label incident reports relying solely on its pretrained knowledge and information included in the prompt. Primary outcomes assessed were sensitivity, specificity, positive predictive value, and negative predictive value. A secondary outcome assessed the human-perceived quality of the model's justification for the label(s) applied.

Results

The LLM demonstrated the ability to label incident reports with high sensitivity and specificity. The model applied a total of 79 labels compared to the reviewer's 49 labels. Overall sensitivity for the model was 85.7%, and specificity was 97.9%. Positive and negative predictive values were 53.2% and 99.6%, respectively. For 60.8% of labels, the reviewer approved of the model's justification for applying the label.

Conclusion

The proprietary LLM demonstrated the ability to label obstetric incident reports with high sensitivity and specificity. LLMs offer the potential to enable more efficient use of data from incident reporting systems.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊