{"title":"Prompt engineering and diagnostic accuracy of multimodal large language models in thyroid fine-needle aspiration cytology.","authors":"Bibhas Saha Dala, Kaushik Mukhopadhyay, Dwaipayan Roy, Souvik Bhattacharya, Indranil Chakrabarti, Santosh Kumar Mondal","doi":"10.6026/97320630021317","DOIUrl":null,"url":null,"abstract":"<p><p>Role of Large language models (LLMs) in fine-needle aspiration cytology (FNAC) image analysis remain uncertain. We evaluated two LLMs - Chat GPT-4o (OpenAI) and Claude 3.5 Sonnet (Anthropic) on 63 thyroid FNAC cases, each represented by eight microscopic images (Pap and MGG, 10x/40x), using generic and structured prompts. Structured prompts improved Bethesda concordance and near-match rates but inter-rater agreement remained poor (κ ≤ 0.09). Specificity reached 100% with structured prompts, but sensitivity dropped to ≤11.8% and misclassification persisted. LLMs show potential, but domain-specific training and validation are necessary for clinical use.</p>","PeriodicalId":8962,"journal":{"name":"Bioinformation","volume":"21 6","pages":"1317-1323"},"PeriodicalIF":1.9000,"publicationDate":"2025-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12449510/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.6026/97320630021317","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Role of Large language models (LLMs) in fine-needle aspiration cytology (FNAC) image analysis remain uncertain. We evaluated two LLMs - Chat GPT-4o (OpenAI) and Claude 3.5 Sonnet (Anthropic) on 63 thyroid FNAC cases, each represented by eight microscopic images (Pap and MGG, 10x/40x), using generic and structured prompts. Structured prompts improved Bethesda concordance and near-match rates but inter-rater agreement remained poor (κ ≤ 0.09). Specificity reached 100% with structured prompts, but sensitivity dropped to ≤11.8% and misclassification persisted. LLMs show potential, but domain-specific training and validation are necessary for clinical use.