João Victor Bruneti Severino, Pedro Angelo Basei de Paula, Matheus Nespolo Berger, Filipe Silveira Loures, Solano Amadori Todeschini, Eduardo Augusto Roeder, Maria Han Veiga, Murilo Guedes, Gustavo Lenci Marques
{"title":"对葡萄牙语Revalida多项选择题的开源大型语言模型进行基准测试。","authors":"João Victor Bruneti Severino, Pedro Angelo Basei de Paula, Matheus Nespolo Berger, Filipe Silveira Loures, Solano Amadori Todeschini, Eduardo Augusto Roeder, Maria Han Veiga, Murilo Guedes, Gustavo Lenci Marques","doi":"10.1136/bmjhci-2024-101195","DOIUrl":null,"url":null,"abstract":"<p><strong>Objective: </strong>The study aimed to evaluate the top large language models (LLMs) in validated medical knowledge tests in Portuguese.</p><p><strong>Methods: </strong>This study compared 31 LLMs in the context of solving the national Brazilian medical examination test. The research compared the performance of 23 open-source and 8 proprietary models across 399 multiple-choice questions.</p><p><strong>Results: </strong>Among the smaller models, Llama 3 8B exhibited the highest success rate, achieving 53.9%, while the medium-sized model Mixtral 8×7B attained a success rate of 63.7%. Conversely, larger models like Llama 3 70B achieved a success rate of 77.5%. Among the proprietary models, GPT-4o and Claude Opus demonstrated superior accuracy, scoring 86.8% and 83.8%, respectively.</p><p><strong>Conclusions: </strong>10 out of the 31 LLMs attained better than human level of performance in the Revalida benchmark, with 9 failing to provide coherent answers to the task. Larger models exhibited superior performance overall. However, certain medium-sized LLMs surpassed the performance of some of the larger LLMs.</p>","PeriodicalId":9050,"journal":{"name":"BMJ Health & Care Informatics","volume":"32 1","pages":""},"PeriodicalIF":4.4000,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12082654/pdf/","citationCount":"0","resultStr":"{\"title\":\"Benchmarking open-source large language models on Portuguese Revalida multiple-choice questions.\",\"authors\":\"João Victor Bruneti Severino, Pedro Angelo Basei de Paula, Matheus Nespolo Berger, Filipe Silveira Loures, Solano Amadori Todeschini, Eduardo Augusto Roeder, Maria Han Veiga, Murilo Guedes, Gustavo Lenci Marques\",\"doi\":\"10.1136/bmjhci-2024-101195\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Objective: </strong>The study aimed to evaluate the top large language models (LLMs) in validated medical knowledge tests in Portuguese.</p><p><strong>Methods: </strong>This study compared 31 LLMs in the context of solving the national Brazilian medical examination test. The research compared the performance of 23 open-source and 8 proprietary models across 399 multiple-choice questions.</p><p><strong>Results: </strong>Among the smaller models, Llama 3 8B exhibited the highest success rate, achieving 53.9%, while the medium-sized model Mixtral 8×7B attained a success rate of 63.7%. Conversely, larger models like Llama 3 70B achieved a success rate of 77.5%. Among the proprietary models, GPT-4o and Claude Opus demonstrated superior accuracy, scoring 86.8% and 83.8%, respectively.</p><p><strong>Conclusions: </strong>10 out of the 31 LLMs attained better than human level of performance in the Revalida benchmark, with 9 failing to provide coherent answers to the task. Larger models exhibited superior performance overall. However, certain medium-sized LLMs surpassed the performance of some of the larger LLMs.</p>\",\"PeriodicalId\":9050,\"journal\":{\"name\":\"BMJ Health & Care Informatics\",\"volume\":\"32 1\",\"pages\":\"\"},\"PeriodicalIF\":4.4000,\"publicationDate\":\"2025-02-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12082654/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMJ Health & Care Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1136/bmjhci-2024-101195\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMJ Health & Care Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1136/bmjhci-2024-101195","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
Benchmarking open-source large language models on Portuguese Revalida multiple-choice questions.
Objective: The study aimed to evaluate the top large language models (LLMs) in validated medical knowledge tests in Portuguese.
Methods: This study compared 31 LLMs in the context of solving the national Brazilian medical examination test. The research compared the performance of 23 open-source and 8 proprietary models across 399 multiple-choice questions.
Results: Among the smaller models, Llama 3 8B exhibited the highest success rate, achieving 53.9%, while the medium-sized model Mixtral 8×7B attained a success rate of 63.7%. Conversely, larger models like Llama 3 70B achieved a success rate of 77.5%. Among the proprietary models, GPT-4o and Claude Opus demonstrated superior accuracy, scoring 86.8% and 83.8%, respectively.
Conclusions: 10 out of the 31 LLMs attained better than human level of performance in the Revalida benchmark, with 9 failing to provide coherent answers to the task. Larger models exhibited superior performance overall. However, certain medium-sized LLMs surpassed the performance of some of the larger LLMs.