Olivia Riccomi, Francesco Andrea Causio, Vittorio De Vita, Antonio Cristiano, Manuel Del Medico, Lorenzo De Mori, Chiara Battipaglia, Melissa Sawaya, Luigi De Angelis, Marcello Di Pumpo, Alessandra Piscitelli, Pietro Eric Risuleo, Giulia Vojvodic, Bianca Destro Castaniti, Nicolò Scarsi
{"title":"Valutazione one-shot di Mistral7B sul nuovo benchmark EuropeMedQA.","authors":"Olivia Riccomi, Francesco Andrea Causio, Vittorio De Vita, Antonio Cristiano, Manuel Del Medico, Lorenzo De Mori, Chiara Battipaglia, Melissa Sawaya, Luigi De Angelis, Marcello Di Pumpo, Alessandra Piscitelli, Pietro Eric Risuleo, Giulia Vojvodic, Bianca Destro Castaniti, Nicolò Scarsi","doi":"10.1701/4573.45804","DOIUrl":null,"url":null,"abstract":"<p><p>Artificial intelligence (AI) adoption in healthcare is rising. Unbiased evaluation requires uncontaminated benchmarks. We evaluated Mistral-7B-Instruct-v0.1 on 1120 human-validated Italian medical multiple-choice questions (SSM). Mistral achieved 40,2% accuracy and 38.8% F1 score on the dataset. Likely causes include English-centric instruction tuning, lack of medical domain knowledge, and prompt misalignment with the task format. These findings suggest that LLMs need further improvements before deployment.</p>","PeriodicalId":20887,"journal":{"name":"Recenti progressi in medicina","volume":"116 10","pages":"619-620"},"PeriodicalIF":0.0000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Recenti progressi in medicina","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1701/4573.45804","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Medicine","Score":null,"Total":0}
引用次数: 0
Abstract
Artificial intelligence (AI) adoption in healthcare is rising. Unbiased evaluation requires uncontaminated benchmarks. We evaluated Mistral-7B-Instruct-v0.1 on 1120 human-validated Italian medical multiple-choice questions (SSM). Mistral achieved 40,2% accuracy and 38.8% F1 score on the dataset. Likely causes include English-centric instruction tuning, lack of medical domain knowledge, and prompt misalignment with the task format. These findings suggest that LLMs need further improvements before deployment.
期刊介绍:
Giunta ormai al sessantesimo anno, Recenti Progressi in Medicina continua a costituire un sicuro punto di riferimento ed uno strumento di lavoro fondamentale per l"ampliamento dell"orizzonte culturale del medico italiano. Recenti Progressi in Medicina è una rivista di medicina interna. Ciò significa il recupero di un"ottica globale e integrata, idonea ad evitare sia i particolarismi della informazione specialistica sia la frammentazione di quella generalista.