Prompt engineering and diagnostic accuracy of multimodal large language models in thyroid fine-needle aspiration cytology.

IF 1.9

Bioinformation Pub Date : 2025-06-30 eCollection Date: 2025-01-01 DOI:10.6026/97320630021317

Bibhas Saha Dala, Kaushik Mukhopadhyay, Dwaipayan Roy, Souvik Bhattacharya, Indranil Chakrabarti, Santosh Kumar Mondal

引用次数: 0

Abstract

Role of Large language models (LLMs) in fine-needle aspiration cytology (FNAC) image analysis remain uncertain. We evaluated two LLMs - Chat GPT-4o (OpenAI) and Claude 3.5 Sonnet (Anthropic) on 63 thyroid FNAC cases, each represented by eight microscopic images (Pap and MGG, 10x/40x), using generic and structured prompts. Structured prompts improved Bethesda concordance and near-match rates but inter-rater agreement remained poor (κ ≤ 0.09). Specificity reached 100% with structured prompts, but sensitivity dropped to ≤11.8% and misclassification persisted. LLMs show potential, but domain-specific training and validation are necessary for clinical use.

查看原文本刊更多论文

甲状腺细针穿刺细胞学中多模态大语言模型的快速工程和诊断准确性。

大语言模型（LLMs）在细针吸细胞学（FNAC）图像分析中的作用仍不确定。我们评估了两个LLMs - Chat gpt - 40 （OpenAI）和Claude 3.5 Sonnet （Anthropic）对63例甲状腺FNAC病例的治疗，每个病例由8个显微图像（Pap和MGG， 10倍/40倍）代表，使用通用和结构化提示。结构化提示提高了Bethesda一致性和接近匹配率，但评分者之间的一致性仍然很差（κ≤0.09）。结构化提示的特异性达到100%，但敏感性降至≤11.8%，错误分类仍然存在。法学硕士显示出潜力，但对于临床应用来说，特定领域的培训和验证是必要的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Bioinformation MATHEMATICAL & COMPUTATIONAL BIOLOGY-

自引率

0.00%

发文量

128