PERFORM研究:人工智能与人类住院医生在跨语言和时间限制的横截面妇产科场景中的对比

Canio Martinelli MD , Antonio Giordano MD , Vincenzo Carnevale PhD , Sharon Raffaella Burk PhD , Lavinia Porto MD , Giuseppe Vizzielli MD , Alfredo Ercoli MD
{"title":"PERFORM研究:人工智能与人类住院医生在跨语言和时间限制的横截面妇产科场景中的对比","authors":"Canio Martinelli MD ,&nbsp;Antonio Giordano MD ,&nbsp;Vincenzo Carnevale PhD ,&nbsp;Sharon Raffaella Burk PhD ,&nbsp;Lavinia Porto MD ,&nbsp;Giuseppe Vizzielli MD ,&nbsp;Alfredo Ercoli MD","doi":"10.1016/j.mcpdig.2025.100206","DOIUrl":null,"url":null,"abstract":"<div><h3>Objective</h3><div>To systematically evaluate the performance of artificial intelligence (AI) large language models (LLMs) compared with obstetrics-gynecology residents in clinical decision-making, examining diagnostic accuracy and error patterns across linguistic domains, time constraints, and experience levels.</div></div><div><h3>Patients and Methods</h3><div>In this cross-sectional study, we evaluated 8 AI LLMs and 24 obstetrics-gynecology residents (Years 1-5) using 60 standardized clinical scenarios. Most AI LLMs and all residents were assessed in May 2024, whereas chat GPT-01-preview, chat-GPT4o, and Claude Sonnet 3.5 were evaluated in November 2024. The assessment framework incorporated English and Italian scenarios under both timed and untimed conditions, along with systematic error pattern analysis. The primary outcome was diagnostic accuracy; secondary end points included AI system stratification, resident progression, language impact, time pressure effects, and integration potential.</div></div><div><h3>Results</h3><div>The AI LLMs reported superior overall accuracy (73.75%; 95% confidence interval [CI], 69.64%-77.49%) compared with residents (65.35%; 95% CI, 62.85%-67.76%; <em>P</em>&lt;.001). High-performing AI systems (ChatGPT-01-preview, GPT4o, and Claude Sonnet 3.5) achieved consistently high cross-linguistic accuracy (88.33%) with minimal language impact (6.67%±0.00%). Resident performance declined significantly under time constraints (from 73.2% to 56.5% adjusted accuracy; Cohen’s d=1.009; <em>P</em>&lt;.001), whereas AI systems reported lesser deterioration. Error pattern analysis indicated a moderate correlation between AI and human reasoning (r=0.666; <em>P</em>&lt;.001). Residents exhibited systematic progression from year 1 (44.7%) to year 5 (87.1%). Integration analysis found variable benefits across training levels, with maximum enhancement in early-career residents (+29.7%; <em>P</em>&lt;.001).</div></div><div><h3>Conclusion</h3><div>High-performing AI LLMs reported strong diagnostic accuracy and resilience under linguistic and temporal pressures. These findings suggest that AI-enhanced decision-making may offer particular benefits in obstetrics and gynecology training programs, especially for junior residents, by improving diagnostic consistency and potentially reducing cognitive load in time-sensitive clinical settings.</div></div>","PeriodicalId":74127,"journal":{"name":"Mayo Clinic Proceedings. Digital health","volume":"3 2","pages":"Article 100206"},"PeriodicalIF":0.0000,"publicationDate":"2025-03-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"The PERFORM Study: Artificial Intelligence Versus Human Residents in Cross-Sectional Obstetrics-Gynecology Scenarios Across Languages and Time Constraints\",\"authors\":\"Canio Martinelli MD ,&nbsp;Antonio Giordano MD ,&nbsp;Vincenzo Carnevale PhD ,&nbsp;Sharon Raffaella Burk PhD ,&nbsp;Lavinia Porto MD ,&nbsp;Giuseppe Vizzielli MD ,&nbsp;Alfredo Ercoli MD\",\"doi\":\"10.1016/j.mcpdig.2025.100206\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Objective</h3><div>To systematically evaluate the performance of artificial intelligence (AI) large language models (LLMs) compared with obstetrics-gynecology residents in clinical decision-making, examining diagnostic accuracy and error patterns across linguistic domains, time constraints, and experience levels.</div></div><div><h3>Patients and Methods</h3><div>In this cross-sectional study, we evaluated 8 AI LLMs and 24 obstetrics-gynecology residents (Years 1-5) using 60 standardized clinical scenarios. Most AI LLMs and all residents were assessed in May 2024, whereas chat GPT-01-preview, chat-GPT4o, and Claude Sonnet 3.5 were evaluated in November 2024. The assessment framework incorporated English and Italian scenarios under both timed and untimed conditions, along with systematic error pattern analysis. The primary outcome was diagnostic accuracy; secondary end points included AI system stratification, resident progression, language impact, time pressure effects, and integration potential.</div></div><div><h3>Results</h3><div>The AI LLMs reported superior overall accuracy (73.75%; 95% confidence interval [CI], 69.64%-77.49%) compared with residents (65.35%; 95% CI, 62.85%-67.76%; <em>P</em>&lt;.001). High-performing AI systems (ChatGPT-01-preview, GPT4o, and Claude Sonnet 3.5) achieved consistently high cross-linguistic accuracy (88.33%) with minimal language impact (6.67%±0.00%). Resident performance declined significantly under time constraints (from 73.2% to 56.5% adjusted accuracy; Cohen’s d=1.009; <em>P</em>&lt;.001), whereas AI systems reported lesser deterioration. Error pattern analysis indicated a moderate correlation between AI and human reasoning (r=0.666; <em>P</em>&lt;.001). Residents exhibited systematic progression from year 1 (44.7%) to year 5 (87.1%). Integration analysis found variable benefits across training levels, with maximum enhancement in early-career residents (+29.7%; <em>P</em>&lt;.001).</div></div><div><h3>Conclusion</h3><div>High-performing AI LLMs reported strong diagnostic accuracy and resilience under linguistic and temporal pressures. These findings suggest that AI-enhanced decision-making may offer particular benefits in obstetrics and gynecology training programs, especially for junior residents, by improving diagnostic consistency and potentially reducing cognitive load in time-sensitive clinical settings.</div></div>\",\"PeriodicalId\":74127,\"journal\":{\"name\":\"Mayo Clinic Proceedings. Digital health\",\"volume\":\"3 2\",\"pages\":\"Article 100206\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-03-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Mayo Clinic Proceedings. Digital health\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2949761225000136\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Mayo Clinic Proceedings. Digital health","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2949761225000136","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

目的系统评估人工智能(AI)大语言模型(llm)与妇产科住院医师在临床决策中的表现,检查跨语言领域、时间限制和经验水平的诊断准确性和错误模式。患者和方法在这项横断面研究中,我们使用60个标准化的临床场景评估了8名AI法学硕士和24名妇产科住院医师(1-5年)。大多数AI llm和所有居民在2024年5月进行评估,而chat GPT-01-preview, chat- gpt40和Claude Sonnet 3.5在2024年11月进行评估。评估框架结合了定时和非定时条件下的英语和意大利语场景,以及系统的错误模式分析。主要结局是诊断准确性;次要终点包括人工智能系统分层、居民进展、语言影响、时间压力效应和整合潜力。结果人工智能LLMs总体准确率为73.75%;95%可信区间[CI], 69.64%-77.49%),而居民(65.35%;95% ci, 62.85%-67.76%;术;措施)。高性能的人工智能系统(ChatGPT-01-preview、gpt40和Claude Sonnet 3.5)在最小的语言影响(6.67%±0.00%)下实现了持续的高跨语言准确率(88.33%)。在时间限制下,住院医生的表现显著下降(调整后准确率从73.2%降至56.5%;科恩的d = 1.009;P<.001),而人工智能系统报告的恶化程度较小。误差模式分析表明,人工智能与人类推理之间存在中度相关性(r=0.666;术;措施)。从第1年(44.7%)到第5年(87.1%),居民表现出系统的进展。综合分析发现,不同培训水平的收益各不相同,早期职业居民的收益最大(+29.7%;术;措施)。结论高性能AI llm在语言和时间压力下具有较强的诊断准确性和弹性。这些发现表明,人工智能增强的决策可以通过提高诊断一致性和潜在地减少时间敏感的临床环境中的认知负荷,为妇产科培训项目提供特别的好处,特别是对初级住院医生。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
The PERFORM Study: Artificial Intelligence Versus Human Residents in Cross-Sectional Obstetrics-Gynecology Scenarios Across Languages and Time Constraints

Objective

To systematically evaluate the performance of artificial intelligence (AI) large language models (LLMs) compared with obstetrics-gynecology residents in clinical decision-making, examining diagnostic accuracy and error patterns across linguistic domains, time constraints, and experience levels.

Patients and Methods

In this cross-sectional study, we evaluated 8 AI LLMs and 24 obstetrics-gynecology residents (Years 1-5) using 60 standardized clinical scenarios. Most AI LLMs and all residents were assessed in May 2024, whereas chat GPT-01-preview, chat-GPT4o, and Claude Sonnet 3.5 were evaluated in November 2024. The assessment framework incorporated English and Italian scenarios under both timed and untimed conditions, along with systematic error pattern analysis. The primary outcome was diagnostic accuracy; secondary end points included AI system stratification, resident progression, language impact, time pressure effects, and integration potential.

Results

The AI LLMs reported superior overall accuracy (73.75%; 95% confidence interval [CI], 69.64%-77.49%) compared with residents (65.35%; 95% CI, 62.85%-67.76%; P<.001). High-performing AI systems (ChatGPT-01-preview, GPT4o, and Claude Sonnet 3.5) achieved consistently high cross-linguistic accuracy (88.33%) with minimal language impact (6.67%±0.00%). Resident performance declined significantly under time constraints (from 73.2% to 56.5% adjusted accuracy; Cohen’s d=1.009; P<.001), whereas AI systems reported lesser deterioration. Error pattern analysis indicated a moderate correlation between AI and human reasoning (r=0.666; P<.001). Residents exhibited systematic progression from year 1 (44.7%) to year 5 (87.1%). Integration analysis found variable benefits across training levels, with maximum enhancement in early-career residents (+29.7%; P<.001).

Conclusion

High-performing AI LLMs reported strong diagnostic accuracy and resilience under linguistic and temporal pressures. These findings suggest that AI-enhanced decision-making may offer particular benefits in obstetrics and gynecology training programs, especially for junior residents, by improving diagnostic consistency and potentially reducing cognitive load in time-sensitive clinical settings.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Mayo Clinic Proceedings. Digital health
Mayo Clinic Proceedings. Digital health Medicine and Dentistry (General), Health Informatics, Public Health and Health Policy
自引率
0.00%
发文量
0
审稿时长
47 days
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信