[Evaluating the accuracy of large language models in answering mammography screening questions in Italian and English: a study based on the Eusobi guidelines.]
{"title":"[Evaluating the accuracy of large language models in answering mammography screening questions in Italian and English: a study based on the Eusobi guidelines.]","authors":"Manuel Signorini, Silvia Fontani, Paola Minichetti, Silvia Teggi, Alessandra Barusco, Massimo Favat","doi":"10.1701/4460.44556","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>Artificial intelligence (AI) is transforming various aspects of everyday life, including healthcare, through large language models (LLMs) like ChatGPT, Gemini, and Copilot. These systems are increasingly used to disseminate medical information, allowing patients to access simplified explanations. This study aims to compare responses to breast imaging-related questions formulated in Italian and English, based on Eusobi guidelines, evaluating the LLMs' ability to provide accurate and complete answers on mammography screening concepts.</p><p><strong>Materials and methods: </strong>Nine questions related to breast cancer screening were developed by five breast radiologists based on Eusobi recommendations. These questions were submitted to ChatGPT, Gemini, and Copilot in both Italian and English. Responses were evaluated by two expert breast radiologists using a Likert scale (1 to 5), with statistical analysis performed to compare the accuracy, average length of responses, use of radiological sources and the agreement among readers.</p><p><strong>Results: </strong>The average scores for responses were similar in both languages, ranging from 3.6 to 4 out of 5. Questions on general mammography concepts received more accurate answers, while more specific questions based on the latest guidelines showed incomplete responses, especially about the definition of dense breast. The sources used, particularly in Italian, were often non-specialized in radiology, highlighting a limitation of LLMs in providing detailed and up-to-date medical answers.</p><p><strong>Conclusions: </strong>The study shows that LLMs are useful tools for medical communication, but they have limitations in delivering accurate answers on highly specialized medical topics. To improve the quality of information, collaboration between AI experts and healthcare professionals is necessary, especially in breast cancer prevention and screening.</p>","PeriodicalId":20887,"journal":{"name":"Recenti progressi in medicina","volume":"116 3","pages":"162-167"},"PeriodicalIF":0.0000,"publicationDate":"2025-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Recenti progressi in medicina","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1701/4460.44556","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Medicine","Score":null,"Total":0}
引用次数: 0
Abstract
Introduction: Artificial intelligence (AI) is transforming various aspects of everyday life, including healthcare, through large language models (LLMs) like ChatGPT, Gemini, and Copilot. These systems are increasingly used to disseminate medical information, allowing patients to access simplified explanations. This study aims to compare responses to breast imaging-related questions formulated in Italian and English, based on Eusobi guidelines, evaluating the LLMs' ability to provide accurate and complete answers on mammography screening concepts.
Materials and methods: Nine questions related to breast cancer screening were developed by five breast radiologists based on Eusobi recommendations. These questions were submitted to ChatGPT, Gemini, and Copilot in both Italian and English. Responses were evaluated by two expert breast radiologists using a Likert scale (1 to 5), with statistical analysis performed to compare the accuracy, average length of responses, use of radiological sources and the agreement among readers.
Results: The average scores for responses were similar in both languages, ranging from 3.6 to 4 out of 5. Questions on general mammography concepts received more accurate answers, while more specific questions based on the latest guidelines showed incomplete responses, especially about the definition of dense breast. The sources used, particularly in Italian, were often non-specialized in radiology, highlighting a limitation of LLMs in providing detailed and up-to-date medical answers.
Conclusions: The study shows that LLMs are useful tools for medical communication, but they have limitations in delivering accurate answers on highly specialized medical topics. To improve the quality of information, collaboration between AI experts and healthcare professionals is necessary, especially in breast cancer prevention and screening.
期刊介绍:
Giunta ormai al sessantesimo anno, Recenti Progressi in Medicina continua a costituire un sicuro punto di riferimento ed uno strumento di lavoro fondamentale per l"ampliamento dell"orizzonte culturale del medico italiano. Recenti Progressi in Medicina è una rivista di medicina interna. Ciò significa il recupero di un"ottica globale e integrata, idonea ad evitare sia i particolarismi della informazione specialistica sia la frammentazione di quella generalista.