Maximilian Habs, Stefan Knecht, Tobias Schmidt-Wilcke
{"title":"Using artificial intelligence (AI) for form and content checks of medical reports: Proofreading by ChatGPT4.0 in a neurology department.","authors":"Maximilian Habs, Stefan Knecht, Tobias Schmidt-Wilcke","doi":"10.1016/j.zefq.2025.02.007","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>Medical reports contain critical information and require concise language, yet often display errors despite advances in digital tools. This study compared the effectiveness of ChatGPT 4.0 in reporting orthographic, grammatical, and content errors in German neurology reports to a human expert.</p><p><strong>Materials and methods: </strong>Ten neurology reports were embedded with ten linguistic errors each, including typographical and grammatical mistakes, and one significant content error. The reports were reviewed by ChatGPT 4.0 using three prompts: (1) check the text for spelling and grammatical errors and report them in a list format without altering the original text, (2) identify spelling and grammatical errors and generate a revised version of the text, ensuring content integrity, (3) evaluate the text for factual inaccuracies, including incorrect information and treatment errors, and report them without modifying the original text. Human control was provided by an experienced medical secretary. Outcome parameters were processing time, percentage of identified errors, and overall error detection rate.</p><p><strong>Results: </strong>Artificial intelligence (AI) accuracy in error detection was 35% (median) for Prompt 1 and 75% for Prompt 2. The mean word count of erroneous medical reports was 980 (SD = 180). AI-driven report generation was significantly faster than human review (AI Prompt 1: 102.4 s; AI Prompt 2: 209.4 s; Human: 374.0 s; p < 0.0001). Prompt 1, a tabular error report, was faster but less accurate than Prompt 2, a revised version of the report (p = 0.0013). Content analysis by Prompt 3 identified 70% of errors in 34.6 seconds.</p><p><strong>Conclusions: </strong>AI-driven text processing for medical reports is feasible and effective. ChatGPT 4.0 demonstrated strong performance in detecting and reporting errors. The effectiveness of AI depends on prompt design, significantly impacting quality and duration. Integration into medical workflows could enhance accuracy and efficiency. AI holds promise in improving medical report writing. However, proper prompt design seems to be crucial. Appropriately integrated AI can significantly enhance supervision and quality control in health care documentation.</p>","PeriodicalId":46628,"journal":{"name":"Zeitschrift fur Evidenz Fortbildung und Qualitaet im Gesundheitswesen","volume":" ","pages":""},"PeriodicalIF":1.4000,"publicationDate":"2025-03-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Zeitschrift fur Evidenz Fortbildung und Qualitaet im Gesundheitswesen","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1016/j.zefq.2025.02.007","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"HEALTH POLICY & SERVICES","Score":null,"Total":0}
引用次数: 0
Abstract
Introduction: Medical reports contain critical information and require concise language, yet often display errors despite advances in digital tools. This study compared the effectiveness of ChatGPT 4.0 in reporting orthographic, grammatical, and content errors in German neurology reports to a human expert.
Materials and methods: Ten neurology reports were embedded with ten linguistic errors each, including typographical and grammatical mistakes, and one significant content error. The reports were reviewed by ChatGPT 4.0 using three prompts: (1) check the text for spelling and grammatical errors and report them in a list format without altering the original text, (2) identify spelling and grammatical errors and generate a revised version of the text, ensuring content integrity, (3) evaluate the text for factual inaccuracies, including incorrect information and treatment errors, and report them without modifying the original text. Human control was provided by an experienced medical secretary. Outcome parameters were processing time, percentage of identified errors, and overall error detection rate.
Results: Artificial intelligence (AI) accuracy in error detection was 35% (median) for Prompt 1 and 75% for Prompt 2. The mean word count of erroneous medical reports was 980 (SD = 180). AI-driven report generation was significantly faster than human review (AI Prompt 1: 102.4 s; AI Prompt 2: 209.4 s; Human: 374.0 s; p < 0.0001). Prompt 1, a tabular error report, was faster but less accurate than Prompt 2, a revised version of the report (p = 0.0013). Content analysis by Prompt 3 identified 70% of errors in 34.6 seconds.
Conclusions: AI-driven text processing for medical reports is feasible and effective. ChatGPT 4.0 demonstrated strong performance in detecting and reporting errors. The effectiveness of AI depends on prompt design, significantly impacting quality and duration. Integration into medical workflows could enhance accuracy and efficiency. AI holds promise in improving medical report writing. However, proper prompt design seems to be crucial. Appropriately integrated AI can significantly enhance supervision and quality control in health care documentation.