Evaluation of general-purpose large language models as diagnostic support tools in cervical cytology.

IF 3.2 4区 医学 Q2 PATHOLOGY
Pathology, research and practice Pub Date : 2025-10-01 Epub Date: 2025-08-07 DOI:10.1016/j.prp.2025.156159
Thiyaphat Laohawetwanit, Sompon Apornvirat, Aleksandra Asaturova, Hua Li, Kris Lami, Andrey Bychkov
{"title":"Evaluation of general-purpose large language models as diagnostic support tools in cervical cytology.","authors":"Thiyaphat Laohawetwanit, Sompon Apornvirat, Aleksandra Asaturova, Hua Li, Kris Lami, Andrey Bychkov","doi":"10.1016/j.prp.2025.156159","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>The application of general-purpose large language models (LLMs) in cytopathology remains largely unexplored. This study aims to evaluate the accuracy and consistency of a custom version of ChatGPT-4 (GPT), ChatGPT o3, and Gemini 2.5 Pro as diagnostic support tools for cervical cytology.</p><p><strong>Materials and methods: </strong>A total of 200 Papanicolaou-stained cervical cytology images were acquired at 40x magnification, each measuring 384 × 384 pixels. These images consisted of 100 cases classified as negative for intraepithelial lesion or malignancy (NILM) and 100 cases across various abnormal categories: 20 low-grade squamous intraepithelial lesion (LSIL), 20 high-grade squamous intraepithelial lesion (HSIL), 20 squamous cell carcinoma (SCC), 20 adenocarcinoma in situ (AIS), and 20 adenocarcinoma (ADC). Diagnostic accuracy and consistency were evaluated by submitting each image to a GPT, ChatGPT o3, and Gemini 2.5 Pro 5-10 times.</p><p><strong>Results: </strong>When distinguishing normal from abnormal cytology, LLMs showed mean sensitivity between 85.4 % and 100 %, and specificity between 67.2 % and 92.7 %. ChatGPT o3 was more accurate in identifying NILM (mean 89.2 % vs. 67.2 %) but less accurate in detecting LSIL (34 % vs. 85 %), HSIL (6 % vs. 63 %), and ADC (28 % vs. 91 %). Chain-of-thought prompting and submitting multiple images of the same diagnosis to ChatGPT o3 and Gemini 2.5 Pro did not significantly improve accuracy. Both models also performed poorly in identifying cervicovaginal infections.</p><p><strong>Conclusions: </strong>ChatGPT o3 and Gemini 2.5 Pro demonstrated complementary strengths in cervical cytology. Due to their low accuracy and inconsistency in abnormal cytology, general-purpose LLMs are not recommended as diagnostic support tools in cervical cytology.</p>","PeriodicalId":19916,"journal":{"name":"Pathology, research and practice","volume":"274 ","pages":"156159"},"PeriodicalIF":3.2000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Pathology, research and practice","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1016/j.prp.2025.156159","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/8/7 0:00:00","PubModel":"Epub","JCR":"Q2","JCRName":"PATHOLOGY","Score":null,"Total":0}
引用次数: 0

Abstract

Introduction: The application of general-purpose large language models (LLMs) in cytopathology remains largely unexplored. This study aims to evaluate the accuracy and consistency of a custom version of ChatGPT-4 (GPT), ChatGPT o3, and Gemini 2.5 Pro as diagnostic support tools for cervical cytology.

Materials and methods: A total of 200 Papanicolaou-stained cervical cytology images were acquired at 40x magnification, each measuring 384 × 384 pixels. These images consisted of 100 cases classified as negative for intraepithelial lesion or malignancy (NILM) and 100 cases across various abnormal categories: 20 low-grade squamous intraepithelial lesion (LSIL), 20 high-grade squamous intraepithelial lesion (HSIL), 20 squamous cell carcinoma (SCC), 20 adenocarcinoma in situ (AIS), and 20 adenocarcinoma (ADC). Diagnostic accuracy and consistency were evaluated by submitting each image to a GPT, ChatGPT o3, and Gemini 2.5 Pro 5-10 times.

Results: When distinguishing normal from abnormal cytology, LLMs showed mean sensitivity between 85.4 % and 100 %, and specificity between 67.2 % and 92.7 %. ChatGPT o3 was more accurate in identifying NILM (mean 89.2 % vs. 67.2 %) but less accurate in detecting LSIL (34 % vs. 85 %), HSIL (6 % vs. 63 %), and ADC (28 % vs. 91 %). Chain-of-thought prompting and submitting multiple images of the same diagnosis to ChatGPT o3 and Gemini 2.5 Pro did not significantly improve accuracy. Both models also performed poorly in identifying cervicovaginal infections.

Conclusions: ChatGPT o3 and Gemini 2.5 Pro demonstrated complementary strengths in cervical cytology. Due to their low accuracy and inconsistency in abnormal cytology, general-purpose LLMs are not recommended as diagnostic support tools in cervical cytology.

评价通用大语言模型作为宫颈细胞学诊断支持工具。
简介:通用大语言模型(llm)在细胞病理学中的应用在很大程度上仍未被探索。本研究旨在评估自定义版ChatGPT-4 (GPT)、ChatGPT o3和Gemini 2.5 Pro作为宫颈细胞学诊断支持工具的准确性和一致性。材料与方法:在40倍倍率下,采集宫颈巴氏染色细胞学图像200张,每张图像384 × 384像素。这些图像包括100例上皮内病变或恶性病变(NILM)阴性和100例各种异常类别:20例低级别鳞状上皮内病变(LSIL), 20例高级别鳞状上皮内病变(HSIL), 20例鳞状细胞癌(SCC), 20例原位腺癌(AIS)和20例腺癌(ADC)。通过将每张图像提交给GPT、ChatGPT o3和Gemini 2.5 Pro 5-10次来评估诊断的准确性和一致性。结果:LLMs在区分正常与异常细胞学时,平均敏感性为85.4 % ~ 100 %,特异性为67.2% % ~ 92.7 %。ChatGPT 3在识别NILM方面更准确(平均89.2 %对67.2 %),但在检测LSIL(34 %对85 %)、HSIL(6 %对63 %)和ADC(28 %对91 %)方面准确率较低。思维链提示和提交同一诊断的多幅图像给ChatGPT 3和Gemini 2.5 Pro并没有显著提高准确性。这两种模型在识别宫颈阴道感染方面也表现不佳。结论:ChatGPT o3与Gemini 2.5 Pro在宫颈细胞学检查中表现出互补优势。由于其在异常细胞学上的低准确性和不一致性,不推荐通用llm作为宫颈细胞学的诊断支持工具。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
CiteScore
5.00
自引率
3.60%
发文量
405
审稿时长
24 days
期刊介绍: Pathology, Research and Practice provides accessible coverage of the most recent developments across the entire field of pathology: Reviews focus on recent progress in pathology, while Comments look at interesting current problems and at hypotheses for future developments in pathology. Original Papers present novel findings on all aspects of general, anatomic and molecular pathology. Rapid Communications inform readers on preliminary findings that may be relevant for further studies and need to be communicated quickly. Teaching Cases look at new aspects or special diagnostic problems of diseases and at case reports relevant for the pathologist''s practice.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信