Iason Psilopatis, Cécile Monod, Valeria Filippi, Rebecca Tschudin, Olaf Lapaire, Julius Emons, Beatrice Mosimann, Tibor A Zwimpfer
{"title":"根据FIGO标准对CTG轨迹评估中公开可用的大型语言模型进行比较评估。","authors":"Iason Psilopatis, Cécile Monod, Valeria Filippi, Rebecca Tschudin, Olaf Lapaire, Julius Emons, Beatrice Mosimann, Tibor A Zwimpfer","doi":"10.1007/s00404-025-08145-w","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Cardiotocography (CTG) remains a cornerstone in fetal monitoring, but its interpretation is subject to considerable inter- and intra-observer variability. Artificial intelligence (AI) tools, particularly large language models (LLMs), offer potential to improve diagnostic consistency and reduce clinician workload.</p><p><strong>Objectives: </strong>This study aims to evaluate and compare the accuracy of various LLMs in CTG interpretation based on Federation of Gynecology and Obstetrics (FIGO 2015) criteria.</p><p><strong>Study design: </strong>An analysis of sixty CTG traces previously classified by clinicians at the University Hospital Basel according to FIGO guidelines was conducted. In a two-run protocol, 30 normal CTG traces were initially presented as screenshots to Chat-GPT-4.0, Google Gemini, Bing Copilot, and DeepSeek. Subsequently, the LLMs that demonstrated adequate interpretation of normal CTGs were tasked to classify another 30 suspicious or pathological CTG traces. Each LLM was asked to classify each CTG trace as normal or abnormal.</p><p><strong>Results: </strong>DeepSeek was unable to interpret CTGs and was excluded. Google Gemini showed poor performance (6.7%) on normal CTGs. Chat-GPT-4.0 partially succeeded in correctly classifying the provided CTG traces as normal (46.7%) or abnormal (50%). Bing Copilot accurately interpreted normal CTGs (96.6%) but failed on abnormal ones (0%).</p><p><strong>Conclusions: </strong>LLMs show major limitations in the interpretation of CTG traces according to the FIGO criteria.</p>","PeriodicalId":8330,"journal":{"name":"Archives of Gynecology and Obstetrics","volume":" ","pages":""},"PeriodicalIF":2.5000,"publicationDate":"2025-08-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A comparative evaluation of publicly available large language models in the assessment of CTG traces according to the FIGO criteria.\",\"authors\":\"Iason Psilopatis, Cécile Monod, Valeria Filippi, Rebecca Tschudin, Olaf Lapaire, Julius Emons, Beatrice Mosimann, Tibor A Zwimpfer\",\"doi\":\"10.1007/s00404-025-08145-w\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Cardiotocography (CTG) remains a cornerstone in fetal monitoring, but its interpretation is subject to considerable inter- and intra-observer variability. Artificial intelligence (AI) tools, particularly large language models (LLMs), offer potential to improve diagnostic consistency and reduce clinician workload.</p><p><strong>Objectives: </strong>This study aims to evaluate and compare the accuracy of various LLMs in CTG interpretation based on Federation of Gynecology and Obstetrics (FIGO 2015) criteria.</p><p><strong>Study design: </strong>An analysis of sixty CTG traces previously classified by clinicians at the University Hospital Basel according to FIGO guidelines was conducted. In a two-run protocol, 30 normal CTG traces were initially presented as screenshots to Chat-GPT-4.0, Google Gemini, Bing Copilot, and DeepSeek. Subsequently, the LLMs that demonstrated adequate interpretation of normal CTGs were tasked to classify another 30 suspicious or pathological CTG traces. Each LLM was asked to classify each CTG trace as normal or abnormal.</p><p><strong>Results: </strong>DeepSeek was unable to interpret CTGs and was excluded. Google Gemini showed poor performance (6.7%) on normal CTGs. Chat-GPT-4.0 partially succeeded in correctly classifying the provided CTG traces as normal (46.7%) or abnormal (50%). Bing Copilot accurately interpreted normal CTGs (96.6%) but failed on abnormal ones (0%).</p><p><strong>Conclusions: </strong>LLMs show major limitations in the interpretation of CTG traces according to the FIGO criteria.</p>\",\"PeriodicalId\":8330,\"journal\":{\"name\":\"Archives of Gynecology and Obstetrics\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":2.5000,\"publicationDate\":\"2025-08-21\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Archives of Gynecology and Obstetrics\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1007/s00404-025-08145-w\",\"RegionNum\":3,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"OBSTETRICS & GYNECOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Archives of Gynecology and Obstetrics","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s00404-025-08145-w","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"OBSTETRICS & GYNECOLOGY","Score":null,"Total":0}
A comparative evaluation of publicly available large language models in the assessment of CTG traces according to the FIGO criteria.
Background: Cardiotocography (CTG) remains a cornerstone in fetal monitoring, but its interpretation is subject to considerable inter- and intra-observer variability. Artificial intelligence (AI) tools, particularly large language models (LLMs), offer potential to improve diagnostic consistency and reduce clinician workload.
Objectives: This study aims to evaluate and compare the accuracy of various LLMs in CTG interpretation based on Federation of Gynecology and Obstetrics (FIGO 2015) criteria.
Study design: An analysis of sixty CTG traces previously classified by clinicians at the University Hospital Basel according to FIGO guidelines was conducted. In a two-run protocol, 30 normal CTG traces were initially presented as screenshots to Chat-GPT-4.0, Google Gemini, Bing Copilot, and DeepSeek. Subsequently, the LLMs that demonstrated adequate interpretation of normal CTGs were tasked to classify another 30 suspicious or pathological CTG traces. Each LLM was asked to classify each CTG trace as normal or abnormal.
Results: DeepSeek was unable to interpret CTGs and was excluded. Google Gemini showed poor performance (6.7%) on normal CTGs. Chat-GPT-4.0 partially succeeded in correctly classifying the provided CTG traces as normal (46.7%) or abnormal (50%). Bing Copilot accurately interpreted normal CTGs (96.6%) but failed on abnormal ones (0%).
Conclusions: LLMs show major limitations in the interpretation of CTG traces according to the FIGO criteria.
期刊介绍:
Founded in 1870 as "Archiv für Gynaekologie", Archives of Gynecology and Obstetrics has a long and outstanding tradition. Since 1922 the journal has been the Organ of the Deutsche Gesellschaft für Gynäkologie und Geburtshilfe. "The Archives of Gynecology and Obstetrics" is circulated in over 40 countries world wide and is indexed in "PubMed/Medline" and "Science Citation Index Expanded/Journal Citation Report".
The journal publishes invited and submitted reviews; peer-reviewed original articles about clinical topics and basic research as well as news and views and guidelines and position statements from all sub-specialties in gynecology and obstetrics.