Using artificial intelligence (AI) for form and content checks of medical reports: Proofreading by ChatGPT4.0 in a neurology department

IF 1.7 Q4 HEALTH POLICY & SERVICES

Zeitschrift fur Evidenz Fortbildung und Qualitaet im Gesundheitswesen Pub Date : 2025-06-01 DOI:10.1016/j.zefq.2025.02.007

Maximilian Habs , Stefan Knecht , Tobias Schmidt-Wilcke

{"title":"Using artificial intelligence (AI) for form and content checks of medical reports: Proofreading by ChatGPT4.0 in a neurology department","authors":"Maximilian Habs , Stefan Knecht , Tobias Schmidt-Wilcke","doi":"10.1016/j.zefq.2025.02.007","DOIUrl":null,"url":null,"abstract":"<div><h3>Introduction</h3><div>Medical reports contain critical information and require concise language, yet often display errors despite advances in digital tools. This study compared the effectiveness of ChatGPT 4.0 in reporting orthographic, grammatical, and content errors in German neurology reports to a human expert.</div></div><div><h3>Materials and Methods</h3><div>Ten neurology reports were embedded with ten linguistic errors each, including typographical and grammatical mistakes, and one significant content error. The reports were reviewed by ChatGPT 4.0 using three prompts: (1) check the text for spelling and grammatical errors and report them in a list format without altering the original text, (2) identify spelling and grammatical errors and generate a revised version of the text, ensuring content integrity, (3) evaluate the text for factual inaccuracies, including incorrect information and treatment errors, and report them without modifying the original text. Human control was provided by an experienced medical secretary. Outcome parameters were processing time, percentage of identified errors, and overall error detection rate.</div></div><div><h3>Results</h3><div>Artificial intelligence (AI) accuracy in error detection was 35% (median) for Prompt 1 and 75% for Prompt 2. The mean word count of erroneous medical reports was 980 (SD = 180). AI-driven report generation was significantly faster than human review (AI Prompt 1: 102.4 s; AI Prompt 2: 209.4 s; Human: 374.0 s; <em>p</em> < 0.0001). Prompt 1, a tabular error report, was faster but less accurate than Prompt 2, a revised version of the report (<em>p</em> = 0.0013). Content analysis by Prompt 3 identified 70% of errors in 34.6 seconds.</div></div><div><h3>Conclusions</h3><div>AI-driven text processing for medical reports is feasible and effective. ChatGPT 4.0 demonstrated strong performance in detecting and reporting errors. The effectiveness of AI depends on prompt design, significantly impacting quality and duration. Integration into medical workflows could enhance accuracy and efficiency. AI holds promise in improving medical report writing. However, proper prompt design seems to be crucial. Appropriately integrated AI can significantly enhance supervision and quality control in health care documentation.</div></div>","PeriodicalId":46628,"journal":{"name":"Zeitschrift fur Evidenz Fortbildung und Qualitaet im Gesundheitswesen","volume":"195 ","pages":"Pages 36-41"},"PeriodicalIF":1.7000,"publicationDate":"2025-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Zeitschrift fur Evidenz Fortbildung und Qualitaet im Gesundheitswesen","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1865921725000790","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"HEALTH POLICY & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction

Medical reports contain critical information and require concise language, yet often display errors despite advances in digital tools. This study compared the effectiveness of ChatGPT 4.0 in reporting orthographic, grammatical, and content errors in German neurology reports to a human expert.

Materials and Methods

Ten neurology reports were embedded with ten linguistic errors each, including typographical and grammatical mistakes, and one significant content error. The reports were reviewed by ChatGPT 4.0 using three prompts: (1) check the text for spelling and grammatical errors and report them in a list format without altering the original text, (2) identify spelling and grammatical errors and generate a revised version of the text, ensuring content integrity, (3) evaluate the text for factual inaccuracies, including incorrect information and treatment errors, and report them without modifying the original text. Human control was provided by an experienced medical secretary. Outcome parameters were processing time, percentage of identified errors, and overall error detection rate.

Results

Artificial intelligence (AI) accuracy in error detection was 35% (median) for Prompt 1 and 75% for Prompt 2. The mean word count of erroneous medical reports was 980 (SD = 180). AI-driven report generation was significantly faster than human review (AI Prompt 1: 102.4 s; AI Prompt 2: 209.4 s; Human: 374.0 s; p < 0.0001). Prompt 1, a tabular error report, was faster but less accurate than Prompt 2, a revised version of the report (p = 0.0013). Content analysis by Prompt 3 identified 70% of errors in 34.6 seconds.

Conclusions

AI-driven text processing for medical reports is feasible and effective. ChatGPT 4.0 demonstrated strong performance in detecting and reporting errors. The effectiveness of AI depends on prompt design, significantly impacting quality and duration. Integration into medical workflows could enhance accuracy and efficiency. AI holds promise in improving medical report writing. However, proper prompt design seems to be crucial. Appropriately integrated AI can significantly enhance supervision and quality control in health care documentation.

查看原文本刊更多论文

利用人工智能（AI）进行医学报告的形式和内容检查：在神经内科使用ChatGPT4.0进行校对。

导读：医学报告包含重要信息，需要简洁的语言，然而，尽管数字工具的进步，经常显示错误。本研究比较了ChatGPT 4.0在向人类专家报告德语神经学报告中的正字法、语法和内容错误方面的有效性。材料和方法：10份神经病学报告包含10个语言错误，包括排版和语法错误，以及1个重大内容错误。ChatGPT 4.0使用三个提示进行审查：(1)检查文本的拼写和语法错误，并以列表格式报告，而不改变原始文本；(2)识别拼写和语法错误，并生成文本的修订版本，确保内容的完整性；(3)评估文本的事实不准确，包括不正确的信息和处理错误，并在不修改原始文本的情况下报告。人工控制由一位经验丰富的医疗秘书提供。结果参数为处理时间、识别错误的百分比和总体错误检出率。结果：人工智能（AI）对提示1的错误检测准确率为35%（中位数），对提示2的准确率为75%。错误医疗报告的平均字数为980（标准差 = 180）。人工智能驱动的报告生成速度明显快于人工审核(AI Prompt 1:10 . 2.4 s；AI提示2:209.4秒；人类：374.0 s；p 结论：人工智能驱动的医学报告文本处理是可行和有效的。ChatGPT 4.0在检测和报告错误方面表现出强大的性能。AI的有效性取决于即时设计，这对质量和持续时间有很大影响。集成到医疗工作流程中可以提高准确性和效率。人工智能有望改善医疗报告的写作。然而，适当的提示设计似乎至关重要。适当整合人工智能可以显著加强医疗保健文件的监督和质量控制。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Zeitschrift fur Evidenz Fortbildung und Qualitaet im Gesundheitswesen HEALTH POLICY & SERVICES-

CiteScore

1.90

自引率

18.20%

发文量

129