{"title":"甲状腺细针穿刺细胞学中多模态大语言模型的快速工程和诊断准确性。","authors":"Bibhas Saha Dala, Kaushik Mukhopadhyay, Dwaipayan Roy, Souvik Bhattacharya, Indranil Chakrabarti, Santosh Kumar Mondal","doi":"10.6026/97320630021317","DOIUrl":null,"url":null,"abstract":"<p><p>Role of Large language models (LLMs) in fine-needle aspiration cytology (FNAC) image analysis remain uncertain. We evaluated two LLMs - Chat GPT-4o (OpenAI) and Claude 3.5 Sonnet (Anthropic) on 63 thyroid FNAC cases, each represented by eight microscopic images (Pap and MGG, 10x/40x), using generic and structured prompts. Structured prompts improved Bethesda concordance and near-match rates but inter-rater agreement remained poor (κ ≤ 0.09). Specificity reached 100% with structured prompts, but sensitivity dropped to ≤11.8% and misclassification persisted. LLMs show potential, but domain-specific training and validation are necessary for clinical use.</p>","PeriodicalId":8962,"journal":{"name":"Bioinformation","volume":"21 6","pages":"1317-1323"},"PeriodicalIF":1.9000,"publicationDate":"2025-06-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12449510/pdf/","citationCount":"0","resultStr":"{\"title\":\"Prompt engineering and diagnostic accuracy of multimodal large language models in thyroid fine-needle aspiration cytology.\",\"authors\":\"Bibhas Saha Dala, Kaushik Mukhopadhyay, Dwaipayan Roy, Souvik Bhattacharya, Indranil Chakrabarti, Santosh Kumar Mondal\",\"doi\":\"10.6026/97320630021317\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><p>Role of Large language models (LLMs) in fine-needle aspiration cytology (FNAC) image analysis remain uncertain. We evaluated two LLMs - Chat GPT-4o (OpenAI) and Claude 3.5 Sonnet (Anthropic) on 63 thyroid FNAC cases, each represented by eight microscopic images (Pap and MGG, 10x/40x), using generic and structured prompts. Structured prompts improved Bethesda concordance and near-match rates but inter-rater agreement remained poor (κ ≤ 0.09). Specificity reached 100% with structured prompts, but sensitivity dropped to ≤11.8% and misclassification persisted. LLMs show potential, but domain-specific training and validation are necessary for clinical use.</p>\",\"PeriodicalId\":8962,\"journal\":{\"name\":\"Bioinformation\",\"volume\":\"21 6\",\"pages\":\"1317-1323\"},\"PeriodicalIF\":1.9000,\"publicationDate\":\"2025-06-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12449510/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Bioinformation\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.6026/97320630021317\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2025/1/1 0:00:00\",\"PubModel\":\"eCollection\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Bioinformation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.6026/97320630021317","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2025/1/1 0:00:00","PubModel":"eCollection","JCR":"","JCRName":"","Score":null,"Total":0}
Prompt engineering and diagnostic accuracy of multimodal large language models in thyroid fine-needle aspiration cytology.
Role of Large language models (LLMs) in fine-needle aspiration cytology (FNAC) image analysis remain uncertain. We evaluated two LLMs - Chat GPT-4o (OpenAI) and Claude 3.5 Sonnet (Anthropic) on 63 thyroid FNAC cases, each represented by eight microscopic images (Pap and MGG, 10x/40x), using generic and structured prompts. Structured prompts improved Bethesda concordance and near-match rates but inter-rater agreement remained poor (κ ≤ 0.09). Specificity reached 100% with structured prompts, but sensitivity dropped to ≤11.8% and misclassification persisted. LLMs show potential, but domain-specific training and validation are necessary for clinical use.