根据FIGO标准对CTG轨迹评估中公开可用的大型语言模型进行比较评估。

IF 2.5 3区医学 Q2 OBSTETRICS & GYNECOLOGY

Archives of Gynecology and Obstetrics Pub Date : 2025-08-21 DOI:10.1007/s00404-025-08145-w

Iason Psilopatis, Cécile Monod, Valeria Filippi, Rebecca Tschudin, Olaf Lapaire, Julius Emons, Beatrice Mosimann, Tibor A Zwimpfer

{"title":"根据FIGO标准对CTG轨迹评估中公开可用的大型语言模型进行比较评估。","authors":"Iason Psilopatis, Cécile Monod, Valeria Filippi, Rebecca Tschudin, Olaf Lapaire, Julius Emons, Beatrice Mosimann, Tibor A Zwimpfer","doi":"10.1007/s00404-025-08145-w","DOIUrl":null,"url":null,"abstract":"Background: Cardiotocography (CTG) remains a cornerstone in fetal monitoring, but its interpretation is subject to considerable inter- and intra-observer variability. Artificial intelligence (AI) tools, particularly large language models (LLMs), offer potential to improve diagnostic consistency and reduce clinician workload.Objectives: This study aims to evaluate and compare the accuracy of various LLMs in CTG interpretation based on Federation of Gynecology and Obstetrics (FIGO 2015) criteria.Study design: An analysis of sixty CTG traces previously classified by clinicians at the University Hospital Basel according to FIGO guidelines was conducted. In a two-run protocol, 30 normal CTG traces were initially presented as screenshots to Chat-GPT-4.0, Google Gemini, Bing Copilot, and DeepSeek. Subsequently, the LLMs that demonstrated adequate interpretation of normal CTGs were tasked to classify another 30 suspicious or pathological CTG traces. Each LLM was asked to classify each CTG trace as normal or abnormal.Results: DeepSeek was unable to interpret CTGs and was excluded. Google Gemini showed poor performance (6.7%) on normal CTGs. Chat-GPT-4.0 partially succeeded in correctly classifying the provided CTG traces as normal (46.7%) or abnormal (50%). Bing Copilot accurately interpreted normal CTGs (96.6%) but failed on abnormal ones (0%).Conclusions: LLMs show major limitations in the interpretation of CTG traces according to the FIGO criteria.","PeriodicalId":8330,"journal":{"name":"Archives of Gynecology and Obstetrics","volume":" ","pages":""},"PeriodicalIF":2.5000,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A comparative evaluation of publicly available large language models in the assessment of CTG traces according to the FIGO criteria.\",\"authors\":\"Iason Psilopatis, Cécile Monod, Valeria Filippi, Rebecca Tschudin, Olaf Lapaire, Julius Emons, Beatrice Mosimann, Tibor A Zwimpfer\",\"doi\":\"10.1007/s00404-025-08145-w\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Cardiotocography (CTG) remains a cornerstone in fetal monitoring, but its interpretation is subject to considerable inter- and intra-observer variability. Artificial intelligence (AI) tools, particularly large language models (LLMs), offer potential to improve diagnostic consistency and reduce clinician workload.Objectives: This study aims to evaluate and compare the accuracy of various LLMs in CTG interpretation based on Federation of Gynecology and Obstetrics (FIGO 2015) criteria.Study design: An analysis of sixty CTG traces previously classified by clinicians at the University Hospital Basel according to FIGO guidelines was conducted. In a two-run protocol, 30 normal CTG traces were initially presented as screenshots to Chat-GPT-4.0, Google Gemini, Bing Copilot, and DeepSeek. Subsequently, the LLMs that demonstrated adequate interpretation of normal CTGs were tasked to classify another 30 suspicious or pathological CTG traces. Each LLM was asked to classify each CTG trace as normal or abnormal.Results: DeepSeek was unable to interpret CTGs and was excluded. Google Gemini showed poor performance (6.7%) on normal CTGs. Chat-GPT-4.0 partially succeeded in correctly classifying the provided CTG traces as normal (46.7%) or abnormal (50%). Bing Copilot accurately interpreted normal CTGs (96.6%) but failed on abnormal ones (0%).Conclusions: LLMs show major limitations in the interpretation of CTG traces according to the FIGO criteria.\",\"PeriodicalId\":8330,\"journal\":{\"name\":\"Archives of Gynecology and Obstetrics\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.5000,\"publicationDate\":\"2025-08-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Archives of Gynecology and Obstetrics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1007/s00404-025-08145-w\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"OBSTETRICS & GYNECOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Archives of Gynecology and Obstetrics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00404-025-08145-w","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"OBSTETRICS & GYNECOLOGY","Score":null,"Total":0}

引用次数: 0

摘要

背景：心脏造影（CTG）仍然是胎儿监测的基石，但其解释受到相当大的观察者之间和内部变异性的影响。人工智能（AI）工具，特别是大型语言模型（llm），提供了提高诊断一致性和减少临床医生工作量的潜力。目的：本研究旨在评估和比较基于妇产科学联合会（FIGO 2015）标准的不同llm在CTG解释中的准确性。研究设计：对巴塞尔大学医院临床医生根据FIGO指南分类的60个CTG痕迹进行分析。在两次运行的方案中，30条正常CTG轨迹最初以截图的形式呈现给Chat-GPT-4.0、谷歌Gemini、Bing Copilot和DeepSeek。随后，对正常CTG表现出充分解释的llm被要求对另外30个可疑或病理CTG痕迹进行分类。每个LLM被要求将每个CTG痕迹分类为正常或异常。结果：DeepSeek无法解释ctg，被排除在外。双子座在正常CTGs上表现不佳（6.7%）。Chat-GPT-4.0部分成功地将提供的CTG迹线正确分类为正常（46.7%）或异常（50%）。Bing副驾驶能准确解释正常ctg(96.6%)，但不能解释异常ctg（0%）。结论：根据FIGO标准，llm在解释CTG痕迹方面显示出主要的局限性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A comparative evaluation of publicly available large language models in the assessment of CTG traces according to the FIGO criteria.

Background: Cardiotocography (CTG) remains a cornerstone in fetal monitoring, but its interpretation is subject to considerable inter- and intra-observer variability. Artificial intelligence (AI) tools, particularly large language models (LLMs), offer potential to improve diagnostic consistency and reduce clinician workload.

Objectives: This study aims to evaluate and compare the accuracy of various LLMs in CTG interpretation based on Federation of Gynecology and Obstetrics (FIGO 2015) criteria.

Study design: An analysis of sixty CTG traces previously classified by clinicians at the University Hospital Basel according to FIGO guidelines was conducted. In a two-run protocol, 30 normal CTG traces were initially presented as screenshots to Chat-GPT-4.0, Google Gemini, Bing Copilot, and DeepSeek. Subsequently, the LLMs that demonstrated adequate interpretation of normal CTGs were tasked to classify another 30 suspicious or pathological CTG traces. Each LLM was asked to classify each CTG trace as normal or abnormal.

Results: DeepSeek was unable to interpret CTGs and was excluded. Google Gemini showed poor performance (6.7%) on normal CTGs. Chat-GPT-4.0 partially succeeded in correctly classifying the provided CTG traces as normal (46.7%) or abnormal (50%). Bing Copilot accurately interpreted normal CTGs (96.6%) but failed on abnormal ones (0%).

Conclusions: LLMs show major limitations in the interpretation of CTG traces according to the FIGO criteria.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Archives of Gynecology and Obstetrics 医学-妇产科学

CiteScore

4.70

自引率

15.40%

发文量

493

审稿时长

1 months

期刊介绍： Founded in 1870 as "Archiv für Gynaekologie", Archives of Gynecology and Obstetrics has a long and outstanding tradition. Since 1922 the journal has been the Organ of the Deutsche Gesellschaft für Gynäkologie und Geburtshilfe. "The Archives of Gynecology and Obstetrics" is circulated in over 40 countries world wide and is indexed in "PubMed/Medline" and "Science Citation Index Expanded/Journal Citation Report". The journal publishes invited and submitted reviews; peer-reviewed original articles about clinical topics and basic research as well as news and views and guidelines and position statements from all sub-specialties in gynecology and obstetrics.