使用大语言模型的肾囊肿波斯尼亚分类:比较研究。

IF 0.6
Ibrahim Hacibey, Esat Kaba
{"title":"使用大语言模型的肾囊肿波斯尼亚分类:比较研究。","authors":"Ibrahim Hacibey, Esat Kaba","doi":"10.1007/s00117-025-01499-x","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>The Bosniak classification system is widely used to assess malignancy risk in renal cystic lesions, yet inter-observer variability poses significant challenges. Large language models (LLMs) may offer a standardized approach to classification when provided with textual descriptions, such as those found in radiology reports.</p><p><strong>Objective: </strong>This study evaluated the performance of five LLMs-GPT‑4 (ChatGPT), Gemini, Copilot, Perplexity, and NotebookLM-in classifying renal cysts based on synthetic textual descriptions mimicking CT report content.</p><p><strong>Methods: </strong>A synthetic dataset of 100 diagnostic scenarios (20 cases per Bosniak category) was constructed using established radiological criteria. Each LLM was evaluated using zero-shot and few-shot prompting strategies, while NotebookLM employed retrieval-augmented generation (RAG). Performance metrics included accuracy, sensitivity, and specificity. Statistical significance was assessed using McNemar's and chi-squared tests.</p><p><strong>Results: </strong>GPT‑4 achieved the highest accuracy (87% zero-shot, 99% few-shot), followed by Copilot (81-86%), Gemini (55-69%), and Perplexity (43-69%). NotebookLM, tested only under RAG conditions, reached 87% accuracy. Few-shot learning significantly improved performance (p < 0.05). Classification of Bosniak IIF lesions remained challenging across models.</p><p><strong>Conclusion: </strong>When provided with well-structured textual descriptions, LLMs can accurately classify renal cysts. Few-shot prompting significantly enhances performance. However, persistent difficulties in classifying borderline lesions such as Bosniak IIF highlight the need for further refinement and real-world validation.</p>","PeriodicalId":74635,"journal":{"name":"Radiologie (Heidelberg, Germany)","volume":" ","pages":""},"PeriodicalIF":0.6000,"publicationDate":"2025-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Bosniak classification of renal cysts using large language models: a comparative study.\",\"authors\":\"Ibrahim Hacibey, Esat Kaba\",\"doi\":\"10.1007/s00117-025-01499-x\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>The Bosniak classification system is widely used to assess malignancy risk in renal cystic lesions, yet inter-observer variability poses significant challenges. Large language models (LLMs) may offer a standardized approach to classification when provided with textual descriptions, such as those found in radiology reports.</p><p><strong>Objective: </strong>This study evaluated the performance of five LLMs-GPT‑4 (ChatGPT), Gemini, Copilot, Perplexity, and NotebookLM-in classifying renal cysts based on synthetic textual descriptions mimicking CT report content.</p><p><strong>Methods: </strong>A synthetic dataset of 100 diagnostic scenarios (20 cases per Bosniak category) was constructed using established radiological criteria. Each LLM was evaluated using zero-shot and few-shot prompting strategies, while NotebookLM employed retrieval-augmented generation (RAG). Performance metrics included accuracy, sensitivity, and specificity. Statistical significance was assessed using McNemar's and chi-squared tests.</p><p><strong>Results: </strong>GPT‑4 achieved the highest accuracy (87% zero-shot, 99% few-shot), followed by Copilot (81-86%), Gemini (55-69%), and Perplexity (43-69%). NotebookLM, tested only under RAG conditions, reached 87% accuracy. Few-shot learning significantly improved performance (p < 0.05). Classification of Bosniak IIF lesions remained challenging across models.</p><p><strong>Conclusion: </strong>When provided with well-structured textual descriptions, LLMs can accurately classify renal cysts. Few-shot prompting significantly enhances performance. However, persistent difficulties in classifying borderline lesions such as Bosniak IIF highlight the need for further refinement and real-world validation.</p>\",\"PeriodicalId\":74635,\"journal\":{\"name\":\"Radiologie (Heidelberg, Germany)\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":0.6000,\"publicationDate\":\"2025-08-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Radiologie (Heidelberg, Germany)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s00117-025-01499-x\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Radiologie (Heidelberg, Germany)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s00117-025-01499-x","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

背景:Bosniak分类系统被广泛用于评估肾囊性病变的恶性风险,但观察者之间的差异带来了重大挑战。当提供文本描述时,大型语言模型(llm)可以提供标准化的分类方法,例如在放射学报告中发现的那些描述。目的:本研究评估五种LLMs-GPT - 4 (ChatGPT)、Gemini、Copilot、Perplexity和notebooklm -基于模拟CT报告内容的合成文本描述对肾囊肿进行分类的性能。方法:使用既定的放射标准构建100个诊断情景(每个波斯尼亚类别20例)的合成数据集。每个LLM使用零次和少次提示策略进行评估,而NotebookLM采用检索增强生成(RAG)。性能指标包括准确性、敏感性和特异性。采用McNemar检验和卡方检验评估统计学显著性。结果:GPT‑4的准确率最高(87%为零射,99%为少射),其次是Copilot(81-86%)、Gemini(55-69%)和Perplexity(43-69%)。仅在RAG条件下测试的NotebookLM准确率达到87%。结论:当提供结构良好的文本描述时,llm可以准确地分类肾囊肿。少量提示显著提高了性能。然而,对边界病变(如Bosniak IIF)进行分类的持续困难突出了进一步改进和实际验证的必要性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Bosniak classification of renal cysts using large language models: a comparative study.

Background: The Bosniak classification system is widely used to assess malignancy risk in renal cystic lesions, yet inter-observer variability poses significant challenges. Large language models (LLMs) may offer a standardized approach to classification when provided with textual descriptions, such as those found in radiology reports.

Objective: This study evaluated the performance of five LLMs-GPT‑4 (ChatGPT), Gemini, Copilot, Perplexity, and NotebookLM-in classifying renal cysts based on synthetic textual descriptions mimicking CT report content.

Methods: A synthetic dataset of 100 diagnostic scenarios (20 cases per Bosniak category) was constructed using established radiological criteria. Each LLM was evaluated using zero-shot and few-shot prompting strategies, while NotebookLM employed retrieval-augmented generation (RAG). Performance metrics included accuracy, sensitivity, and specificity. Statistical significance was assessed using McNemar's and chi-squared tests.

Results: GPT‑4 achieved the highest accuracy (87% zero-shot, 99% few-shot), followed by Copilot (81-86%), Gemini (55-69%), and Perplexity (43-69%). NotebookLM, tested only under RAG conditions, reached 87% accuracy. Few-shot learning significantly improved performance (p < 0.05). Classification of Bosniak IIF lesions remained challenging across models.

Conclusion: When provided with well-structured textual descriptions, LLMs can accurately classify renal cysts. Few-shot prompting significantly enhances performance. However, persistent difficulties in classifying borderline lesions such as Bosniak IIF highlight the need for further refinement and real-world validation.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信