Large language models can support generation of standardized discharge summaries – A retrospective study utilizing ChatGPT-4 and electronic health records
IF 3.7 2区 医学Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS
Arne Schwieger , Katrin Angst , Mateo de Bardeci , Achim Burrer , Flurin Cathomas , Stefano Ferrea , Franziska Grätz , Marius Knorr , Golo Kronenberg , Tobias Spiller , David Troi , Erich Seifritz , Samantha Weber , Sebastian Olbrich
{"title":"Large language models can support generation of standardized discharge summaries – A retrospective study utilizing ChatGPT-4 and electronic health records","authors":"Arne Schwieger , Katrin Angst , Mateo de Bardeci , Achim Burrer , Flurin Cathomas , Stefano Ferrea , Franziska Grätz , Marius Knorr , Golo Kronenberg , Tobias Spiller , David Troi , Erich Seifritz , Samantha Weber , Sebastian Olbrich","doi":"10.1016/j.ijmedinf.2024.105654","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><div>To evaluate whether psychiatric discharge summaries (DS) generated with ChatGPT-4 from electronic health records (EHR) can match the quality of DS written by psychiatric residents.</div></div><div><h3>Methods</h3><div>At a psychiatric primary care hospital, we compared 20 inpatient DS, written by residents, to those written with ChatGPT-4 from pseudonymized residents’ notes of the patients’ EHRs and a standardized prompt. 8 blinded psychiatry specialists rated both versions on a custom Likert scale from 1 to 5 across 15 quality subcategories. The primary outcome was the overall rating difference between the two groups. The secondary outcomes were the rating differences at the level of individual question, case, and rater.</div></div><div><h3>Results</h3><div>Human-written DS were rated significantly higher than AI (mean ratings: human 3.78, AI 3.12, p < 0.05). They surpassed AI significantly in 12/15 questions and 16/20 cases and were favored significantly by 7/8 raters. For “low expected correction effort”, human DS were rated as 67 % favorable, 19 % neutral, and 14 % unfavorable, whereas AI-DS were rated as 22 % favorable, 33 % neutral, and 45 % unfavorable. Hallucinations were present in 40 % of AI-DS, with 37.5 % deemed highly clinically relevant. Minor content mistakes were found in 30 % of AI and 10 % of human DS. Raters correctly identified AI-DS with 81 % sensitivity and 75 % specificity.</div></div><div><h3>Discussion</h3><div>Overall, AI-DS did not match the quality of resident-written DS but performed similarly in 20% of cases and were rated as favorable for “low expected correction effort” in 22% of cases. AI-DS lacked most in content specificity, ability to distill key case information, and coherence but performed adequately in conciseness, adherence to formalities, relevance of included content, and form.</div></div><div><h3>Conclusion</h3><div>LLM-written DS show potential as templates for physicians to finalize, potentially saving time in the future.</div></div>","PeriodicalId":54950,"journal":{"name":"International Journal of Medical Informatics","volume":"192 ","pages":"Article 105654"},"PeriodicalIF":3.7000,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Medical Informatics","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1386505624003174","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
引用次数: 0
Abstract
Objective
To evaluate whether psychiatric discharge summaries (DS) generated with ChatGPT-4 from electronic health records (EHR) can match the quality of DS written by psychiatric residents.
Methods
At a psychiatric primary care hospital, we compared 20 inpatient DS, written by residents, to those written with ChatGPT-4 from pseudonymized residents’ notes of the patients’ EHRs and a standardized prompt. 8 blinded psychiatry specialists rated both versions on a custom Likert scale from 1 to 5 across 15 quality subcategories. The primary outcome was the overall rating difference between the two groups. The secondary outcomes were the rating differences at the level of individual question, case, and rater.
Results
Human-written DS were rated significantly higher than AI (mean ratings: human 3.78, AI 3.12, p < 0.05). They surpassed AI significantly in 12/15 questions and 16/20 cases and were favored significantly by 7/8 raters. For “low expected correction effort”, human DS were rated as 67 % favorable, 19 % neutral, and 14 % unfavorable, whereas AI-DS were rated as 22 % favorable, 33 % neutral, and 45 % unfavorable. Hallucinations were present in 40 % of AI-DS, with 37.5 % deemed highly clinically relevant. Minor content mistakes were found in 30 % of AI and 10 % of human DS. Raters correctly identified AI-DS with 81 % sensitivity and 75 % specificity.
Discussion
Overall, AI-DS did not match the quality of resident-written DS but performed similarly in 20% of cases and were rated as favorable for “low expected correction effort” in 22% of cases. AI-DS lacked most in content specificity, ability to distill key case information, and coherence but performed adequately in conciseness, adherence to formalities, relevance of included content, and form.
Conclusion
LLM-written DS show potential as templates for physicians to finalize, potentially saving time in the future.
期刊介绍:
International Journal of Medical Informatics provides an international medium for dissemination of original results and interpretative reviews concerning the field of medical informatics. The Journal emphasizes the evaluation of systems in healthcare settings.
The scope of journal covers:
Information systems, including national or international registration systems, hospital information systems, departmental and/or physician''s office systems, document handling systems, electronic medical record systems, standardization, systems integration etc.;
Computer-aided medical decision support systems using heuristic, algorithmic and/or statistical methods as exemplified in decision theory, protocol development, artificial intelligence, etc.
Educational computer based programs pertaining to medical informatics or medicine in general;
Organizational, economic, social, clinical impact, ethical and cost-benefit aspects of IT applications in health care.