The potential of Generative Pre-trained Transformer 4 (GPT-4) to analyse medical notes in three different languages: a retrospective model-evaluation study
Maria Clara Saad Menezes MD , Alexander F Hoffmann MBA , Amelia L M Tan PhD , Mariné Nalbandyan MD , Prof Gilbert S Omenn MD , Diego R Mazzotti PhD , Alejandro Hernández-Arango MD , Prof Shyam Visweswaran MD , Shruthi Venkatesh BSc , Prof Kenneth D Mandl MD , Florence T Bourgeois MD , James W K Lee MD , Andrew Makmur MBBS , David A Hanauer MD , Michael G Semanik MD , Lauren T Kerivan MD , Terra Hill MD , Julian Forero MD , Carlos Restrepo MD , Matteo Vigna MD , Prof Isaac S Kohane MD
{"title":"The potential of Generative Pre-trained Transformer 4 (GPT-4) to analyse medical notes in three different languages: a retrospective model-evaluation study","authors":"Maria Clara Saad Menezes MD , Alexander F Hoffmann MBA , Amelia L M Tan PhD , Mariné Nalbandyan MD , Prof Gilbert S Omenn MD , Diego R Mazzotti PhD , Alejandro Hernández-Arango MD , Prof Shyam Visweswaran MD , Shruthi Venkatesh BSc , Prof Kenneth D Mandl MD , Florence T Bourgeois MD , James W K Lee MD , Andrew Makmur MBBS , David A Hanauer MD , Michael G Semanik MD , Lauren T Kerivan MD , Terra Hill MD , Julian Forero MD , Carlos Restrepo MD , Matteo Vigna MD , Prof Isaac S Kohane MD","doi":"10.1016/S2589-7500(24)00246-2","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>Patient notes contain substantial information but are difficult for computers to analyse due to their unstructured format. Large-language models (LLMs), such as Generative Pre-trained Transformer 4 (GPT-4), have changed our ability to process text, but we do not know how effectively they handle medical notes. We aimed to assess the ability of GPT-4 to answer predefined questions after reading medical notes in three different languages.</div></div><div><h3>Methods</h3><div>For this retrospective model-evaluation study, we included eight university hospitals from four countries (ie, the USA, Colombia, Singapore, and Italy). Each site submitted seven de-identified medical notes related to seven separate patients to the coordinating centre between June 1, 2023, and Feb 28, 2024. Medical notes were written between Feb 1, 2020, and June 1, 2023. One site provided medical notes in Spanish, one site provided notes in Italian, and the remaining six sites provided notes in English. We included admission notes, progress notes, and consultation notes. No discharge summaries were included in this study. We advised participating sites to choose medical notes that, at time of hospital admission, were for patients who were male or female, aged 18–65 years, had a diagnosis of obesity, had a diagnosis of COVID-19, and had submitted an admission note. Adherence to these criteria was optional and participating sites randomly chose which medical notes to submit. When entering information into GPT-4, we prepended each medical note with an instruction prompt and a list of 14 questions that had been chosen a priori. Each medical note was individually given to GPT-4 in its original language and in separate sessions; the questions were always given in English. At each site, two physicians independently validated responses by GPT-4 and responded to all 14 questions. Each pair of physicians evaluated responses from GPT-4 to the seven medical notes from their own site only. Physicians were not masked to responses from GPT-4 before providing their own answers, but were masked to responses from the other physician.</div></div><div><h3>Findings</h3><div>We collected 56 medical notes, of which 42 (75%) were in English, seven (13%) were in Italian, and seven (13%) were in Spanish. For each medical note, GPT-4 responded to 14 questions, resulting in 784 responses. In 622 (79%, 95% CI 76–82) of 784 responses, both physicians agreed with GPT-4. In 82 (11%, 8–13) responses, only one physician agreed with GPT-4. In the remaining 80 (10%, 8–13) responses, neither physician agreed with GPT-4. Both physicians agreed with GPT-4 more often for medical notes written in Spanish (86 [88%, 95% CI 79–93] of 98 responses) and Italian (82 [84%, 75–90] of 98 responses) than in English (454 [77%, 74–80] of 588 responses).</div></div><div><h3>Interpretation</h3><div>The results of our model-evaluation study suggest that GPT-4 is accurate when analysing medical notes in three different languages. In the future, research should explore how LLMs can be integrated into clinical workflows to maximise their use in health care.</div></div><div><h3>Funding</h3><div>None.</div></div>","PeriodicalId":48534,"journal":{"name":"Lancet Digital Health","volume":"7 1","pages":"Pages e35-e43"},"PeriodicalIF":23.8000,"publicationDate":"2025-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Lancet Digital Health","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2589750024002462","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}
引用次数: 0
Abstract
Background
Patient notes contain substantial information but are difficult for computers to analyse due to their unstructured format. Large-language models (LLMs), such as Generative Pre-trained Transformer 4 (GPT-4), have changed our ability to process text, but we do not know how effectively they handle medical notes. We aimed to assess the ability of GPT-4 to answer predefined questions after reading medical notes in three different languages.
Methods
For this retrospective model-evaluation study, we included eight university hospitals from four countries (ie, the USA, Colombia, Singapore, and Italy). Each site submitted seven de-identified medical notes related to seven separate patients to the coordinating centre between June 1, 2023, and Feb 28, 2024. Medical notes were written between Feb 1, 2020, and June 1, 2023. One site provided medical notes in Spanish, one site provided notes in Italian, and the remaining six sites provided notes in English. We included admission notes, progress notes, and consultation notes. No discharge summaries were included in this study. We advised participating sites to choose medical notes that, at time of hospital admission, were for patients who were male or female, aged 18–65 years, had a diagnosis of obesity, had a diagnosis of COVID-19, and had submitted an admission note. Adherence to these criteria was optional and participating sites randomly chose which medical notes to submit. When entering information into GPT-4, we prepended each medical note with an instruction prompt and a list of 14 questions that had been chosen a priori. Each medical note was individually given to GPT-4 in its original language and in separate sessions; the questions were always given in English. At each site, two physicians independently validated responses by GPT-4 and responded to all 14 questions. Each pair of physicians evaluated responses from GPT-4 to the seven medical notes from their own site only. Physicians were not masked to responses from GPT-4 before providing their own answers, but were masked to responses from the other physician.
Findings
We collected 56 medical notes, of which 42 (75%) were in English, seven (13%) were in Italian, and seven (13%) were in Spanish. For each medical note, GPT-4 responded to 14 questions, resulting in 784 responses. In 622 (79%, 95% CI 76–82) of 784 responses, both physicians agreed with GPT-4. In 82 (11%, 8–13) responses, only one physician agreed with GPT-4. In the remaining 80 (10%, 8–13) responses, neither physician agreed with GPT-4. Both physicians agreed with GPT-4 more often for medical notes written in Spanish (86 [88%, 95% CI 79–93] of 98 responses) and Italian (82 [84%, 75–90] of 98 responses) than in English (454 [77%, 74–80] of 588 responses).
Interpretation
The results of our model-evaluation study suggest that GPT-4 is accurate when analysing medical notes in three different languages. In the future, research should explore how LLMs can be integrated into clinical workflows to maximise their use in health care.
背景:患者笔记包含大量信息,但由于其非结构化格式,计算机难以分析。大语言模型(llm),如生成预训练变压器4 (GPT-4),已经改变了我们处理文本的能力,但我们不知道他们如何有效地处理医疗记录。我们的目的是评估GPT-4在阅读三种不同语言的医疗记录后回答预定问题的能力。方法:在这项回顾性模型评价研究中,我们纳入了来自四个国家(即美国、哥伦比亚、新加坡和意大利)的八所大学医院。每个站点在2023年6月1日至2024年2月28日期间向协调中心提交了7份与7名不同患者相关的未识别医疗记录。医疗记录写于2020年2月1日至2023年6月1日之间。一个站点以西班牙语提供医疗说明,一个站点以意大利语提供说明,其余六个站点以英语提供说明。我们包括了入院记录、进展记录和会诊记录。本研究未纳入出院总结。我们建议参与网站选择住院时的医疗记录,这些记录适用于年龄在18-65岁、诊断为肥胖、诊断为COVID-19并提交了住院记录的男性或女性患者。遵守这些标准是可选的,参与站点随机选择提交哪些医疗记录。在GPT-4中输入信息时,我们在每张医疗记录前附上一个指示提示和一个先验选择的14个问题的列表。每一份医疗记录都以其原始语言在单独的会议中单独提供给GPT-4;问题都是用英语提出的。在每个地点,两名医生独立验证GPT-4的回答,并回答所有14个问题。每对医生仅从他们自己的网站上评估GPT-4对七份医疗记录的反应。在提供自己的答案之前,医生并没有对GPT-4的反应隐瞒,但对其他医生的反应隐瞒。结果:我们收集了56份病历,其中42份(75%)为英文,7份(13%)为意大利语,7份(13%)为西班牙语。对于每份医疗记录,GPT-4回答了14个问题,得到了784份回复。在784名应答者中,有622名(79%,95% CI 76-82)的医生都同意GPT-4。在82例(11%,8-13)答复中,只有一位医生同意GPT-4。在剩下的80例(10%,8-13例)答复中,两名医生都不同意GPT-4。对于用西班牙语(98份回复中有86份[88%,95% CI 79-93])和意大利语(98份回复中有82份[84%,75-90])撰写的病历,两位医生都更赞同GPT-4,而不是用英语(588份回复中有454份[77%,74-80])撰写的病历。解释:我们的模型评估研究结果表明,GPT-4在分析三种不同语言的医疗记录时是准确的。在未来,研究应该探索法学硕士如何整合到临床工作流程中,以最大限度地发挥其在医疗保健中的作用。资金:没有。
期刊介绍:
The Lancet Digital Health publishes important, innovative, and practice-changing research on any topic connected with digital technology in clinical medicine, public health, and global health.
The journal’s open access content crosses subject boundaries, building bridges between health professionals and researchers.By bringing together the most important advances in this multidisciplinary field,The Lancet Digital Health is the most prominent publishing venue in digital health.
We publish a range of content types including Articles,Review, Comment, and Correspondence, contributing to promoting digital technologies in health practice worldwide.