对葡萄牙语Revalida多项选择题的开源大型语言模型进行基准测试。

IF 4.4 Q1 HEALTH CARE SCIENCES & SERVICES

BMJ Health & Care Informatics Pub Date : 2025-02-24 DOI:10.1136/bmjhci-2024-101195

João Victor Bruneti Severino, Pedro Angelo Basei de Paula, Matheus Nespolo Berger, Filipe Silveira Loures, Solano Amadori Todeschini, Eduardo Augusto Roeder, Maria Han Veiga, Murilo Guedes, Gustavo Lenci Marques

{"title":"对葡萄牙语Revalida多项选择题的开源大型语言模型进行基准测试。","authors":"João Victor Bruneti Severino, Pedro Angelo Basei de Paula, Matheus Nespolo Berger, Filipe Silveira Loures, Solano Amadori Todeschini, Eduardo Augusto Roeder, Maria Han Veiga, Murilo Guedes, Gustavo Lenci Marques","doi":"10.1136/bmjhci-2024-101195","DOIUrl":null,"url":null,"abstract":"Objective: The study aimed to evaluate the top large language models (LLMs) in validated medical knowledge tests in Portuguese.Methods: This study compared 31 LLMs in the context of solving the national Brazilian medical examination test. The research compared the performance of 23 open-source and 8 proprietary models across 399 multiple-choice questions.Results: Among the smaller models, Llama 3 8B exhibited the highest success rate, achieving 53.9%, while the medium-sized model Mixtral 8×7B attained a success rate of 63.7%. Conversely, larger models like Llama 3 70B achieved a success rate of 77.5%. Among the proprietary models, GPT-4o and Claude Opus demonstrated superior accuracy, scoring 86.8% and 83.8%, respectively.Conclusions: 10 out of the 31 LLMs attained better than human level of performance in the Revalida benchmark, with 9 failing to provide coherent answers to the task. Larger models exhibited superior performance overall. However, certain medium-sized LLMs surpassed the performance of some of the larger LLMs.","PeriodicalId":9050,"journal":{"name":"BMJ Health & Care Informatics","volume":"32 1","pages":""},"PeriodicalIF":4.4000,"publicationDate":"2025-02-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12082654/pdf/","citationCount":"0","resultStr":"{\"title\":\"Benchmarking open-source large language models on Portuguese Revalida multiple-choice questions.\",\"authors\":\"João Victor Bruneti Severino, Pedro Angelo Basei de Paula, Matheus Nespolo Berger, Filipe Silveira Loures, Solano Amadori Todeschini, Eduardo Augusto Roeder, Maria Han Veiga, Murilo Guedes, Gustavo Lenci Marques\",\"doi\":\"10.1136/bmjhci-2024-101195\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Objective: The study aimed to evaluate the top large language models (LLMs) in validated medical knowledge tests in Portuguese.Methods: This study compared 31 LLMs in the context of solving the national Brazilian medical examination test. The research compared the performance of 23 open-source and 8 proprietary models across 399 multiple-choice questions.Results: Among the smaller models, Llama 3 8B exhibited the highest success rate, achieving 53.9%, while the medium-sized model Mixtral 8×7B attained a success rate of 63.7%. Conversely, larger models like Llama 3 70B achieved a success rate of 77.5%. Among the proprietary models, GPT-4o and Claude Opus demonstrated superior accuracy, scoring 86.8% and 83.8%, respectively.Conclusions: 10 out of the 31 LLMs attained better than human level of performance in the Revalida benchmark, with 9 failing to provide coherent answers to the task. Larger models exhibited superior performance overall. However, certain medium-sized LLMs surpassed the performance of some of the larger LLMs.\",\"PeriodicalId\":9050,\"journal\":{\"name\":\"BMJ Health & Care Informatics\",\"volume\":\"32 1\",\"pages\":\"\"},\"PeriodicalIF\":4.4000,\"publicationDate\":\"2025-02-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12082654/pdf/\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"BMJ Health & Care Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1136/bmjhci-2024-101195\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"HEALTH CARE SCIENCES & SERVICES\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMJ Health & Care Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1136/bmjhci-2024-101195","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

摘要

研究目的本研究旨在评估葡萄牙语医学知识验证测试中的顶级大型语言模型（LLMs）：本研究以巴西国家医学考试为背景，比较了 31 种大型语言模型。研究比较了 23 个开源模型和 8 个专有模型在 399 道选择题中的表现：在小型模型中，Llama 3 8B 的成功率最高，达到 53.9%，而中型模型 Mixtral 8×7B 的成功率为 63.7%。相反，Llama 3 70B 等大型机型的成功率为 77.5%。在专利模型中，GPT-4o 和 Claude Opus 的准确率较高，分别为 86.8% 和 83.8%：结论：在 31 个 LLM 中，有 10 个在 Revalida 基准测试中取得了优于人类水平的成绩，其中 9 个未能提供连贯的任务答案。较大型的模型整体表现出更优越的性能。不过，某些中型 LLM 的性能超过了一些大型 LLM。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Benchmarking open-source large language models on Portuguese Revalida multiple-choice questions.

Objective: The study aimed to evaluate the top large language models (LLMs) in validated medical knowledge tests in Portuguese.

Methods: This study compared 31 LLMs in the context of solving the national Brazilian medical examination test. The research compared the performance of 23 open-source and 8 proprietary models across 399 multiple-choice questions.

Results: Among the smaller models, Llama 3 8B exhibited the highest success rate, achieving 53.9%, while the medium-sized model Mixtral 8×7B attained a success rate of 63.7%. Conversely, larger models like Llama 3 70B achieved a success rate of 77.5%. Among the proprietary models, GPT-4o and Claude Opus demonstrated superior accuracy, scoring 86.8% and 83.8%, respectively.

Conclusions: 10 out of the 31 LLMs attained better than human level of performance in the Revalida benchmark, with 9 failing to provide coherent answers to the task. Larger models exhibited superior performance overall. However, certain medium-sized LLMs surpassed the performance of some of the larger LLMs.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

BMJ Health & Care Informatics Multiple-

CiteScore

6.10

自引率

4.90%

发文量

审稿时长

18 weeks