MedExpQA：用于医学问题解答的大型语言模型的多语言基准测试。

IF 6.1 2区医学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Artificial Intelligence in Medicine Pub Date : 2024-07-31 DOI:10.1016/j.artmed.2024.102938

Iñigo Alonso, Maite Oronoz, Rodrigo Agerri

{"title":"MedExpQA：用于医学问题解答的大型语言模型的多语言基准测试。","authors":"Iñigo Alonso, Maite Oronoz, Rodrigo Agerri","doi":"10.1016/j.artmed.2024.102938","DOIUrl":null,"url":null,"abstract":"<div><p>Large Language Models (LLMs) have the potential of facilitating the development of Artificial Intelligence technology to assist medical experts for interactive decision support. This potential has been illustrated by the state-of-the-art performance obtained by LLMs in Medical Question Answering, with striking results such as passing marks in licensing medical exams. However, while impressive, the required quality bar for medical applications remains far from being achieved. Currently, LLMs remain challenged by outdated knowledge and by their tendency to generate hallucinated content. Furthermore, most benchmarks to assess medical knowledge lack reference gold explanations which means that it is not possible to evaluate the reasoning of LLMs predictions. Finally, the situation is particularly grim if we consider benchmarking LLMs for languages other than English which remains, as far as we know, a totally neglected topic. In order to address these shortcomings, in this paper we present MedExpQA, the first multilingual benchmark based on medical exams to evaluate LLMs in Medical Question Answering. To the best of our knowledge, MedExpQA includes for the first time reference gold explanations, written by medical doctors, of the correct and incorrect options in the exams. Comprehensive multilingual experimentation using both the gold reference explanations and Retrieval Augmented Generation (RAG) approaches show that performance of LLMs, with best results around 75 accuracy for English, still has large room for improvement, especially for languages other than English, for which accuracy drops 10 points. Therefore, despite using state-of-the-art RAG methods, our results also demonstrate the difficulty of obtaining and integrating readily available medical knowledge that may positively impact results on downstream evaluations for Medical Question Answering. Data, code, and fine-tuned models will be made publicly available.<span><span><sup>1</sup></span></span></p></div>","PeriodicalId":55458,"journal":{"name":"Artificial Intelligence in Medicine","volume":"155 ","pages":"Article 102938"},"PeriodicalIF":6.1000,"publicationDate":"2024-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0933365724001805/pdfft?md5=4257af50106cf356598f8ea351cad6b8&pid=1-s2.0-S0933365724001805-main.pdf","citationCount":"0","resultStr":"{\"title\":\"MedExpQA: Multilingual benchmarking of Large Language Models for Medical Question Answering\",\"authors\":\"Iñigo Alonso, Maite Oronoz, Rodrigo Agerri\",\"doi\":\"10.1016/j.artmed.2024.102938\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Large Language Models (LLMs) have the potential of facilitating the development of Artificial Intelligence technology to assist medical experts for interactive decision support. This potential has been illustrated by the state-of-the-art performance obtained by LLMs in Medical Question Answering, with striking results such as passing marks in licensing medical exams. However, while impressive, the required quality bar for medical applications remains far from being achieved. Currently, LLMs remain challenged by outdated knowledge and by their tendency to generate hallucinated content. Furthermore, most benchmarks to assess medical knowledge lack reference gold explanations which means that it is not possible to evaluate the reasoning of LLMs predictions. Finally, the situation is particularly grim if we consider benchmarking LLMs for languages other than English which remains, as far as we know, a totally neglected topic. In order to address these shortcomings, in this paper we present MedExpQA, the first multilingual benchmark based on medical exams to evaluate LLMs in Medical Question Answering. To the best of our knowledge, MedExpQA includes for the first time reference gold explanations, written by medical doctors, of the correct and incorrect options in the exams. Comprehensive multilingual experimentation using both the gold reference explanations and Retrieval Augmented Generation (RAG) approaches show that performance of LLMs, with best results around 75 accuracy for English, still has large room for improvement, especially for languages other than English, for which accuracy drops 10 points. Therefore, despite using state-of-the-art RAG methods, our results also demonstrate the difficulty of obtaining and integrating readily available medical knowledge that may positively impact results on downstream evaluations for Medical Question Answering. Data, code, and fine-tuned models will be made publicly available.<span><span><sup>1</sup></span></span></p></div>\",\"PeriodicalId\":55458,\"journal\":{\"name\":\"Artificial Intelligence in Medicine\",\"volume\":\"155 \",\"pages\":\"Article 102938\"},\"PeriodicalIF\":6.1000,\"publicationDate\":\"2024-07-31\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S0933365724001805/pdfft?md5=4257af50106cf356598f8ea351cad6b8&pid=1-s2.0-S0933365724001805-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Artificial Intelligence in Medicine\",\"FirstCategoryId\":\"5\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S0933365724001805\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Artificial Intelligence in Medicine","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0933365724001805","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

大型语言模型（LLMs）具有促进人工智能技术发展的潜力，可协助医学专家进行交互式决策支持。大型语言模型在医学问题解答中取得的一流性能已经证明了这一潜力，并取得了令人瞩目的成绩，如在执业医师资格考试中取得及格分数。然而，尽管令人印象深刻，但医学应用所需的质量标准仍远未达到。目前，法学硕士仍然面临着知识过时和容易产生幻觉内容的挑战。此外，大多数评估医学知识的基准都缺乏参考金解释，这意味着无法评估法学硕士预测的推理能力。最后，如果我们考虑对英语以外的语言进行 LLMs 基准测试，情况将尤为严峻，据我们所知，英语仍然是一个完全被忽视的话题。为了解决这些不足，我们在本文中介绍了 MedExpQA，这是第一个基于医学考试的多语言基准，用于评估医学问题解答中的 LLM。据我们所知，MedExpQA 首次包含了由医生撰写的关于考试中正确和错误选项的金牌参考解释。使用黄金参考解释和检索增强生成（RAG）方法进行的多语言综合实验表明，LLMs 的性能在英语方面的最佳结果为 75% 左右的准确率，但仍有很大的改进空间，尤其是在英语以外的语言方面，准确率下降了 10 个百分点。因此，尽管使用了最先进的 RAG 方法，我们的结果也证明了获取和整合现成医学知识的难度，而这些知识可能会对医学问题解答的下游评估结果产生积极影响。数据、代码和微调模型将公开发布。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

MedExpQA: Multilingual benchmarking of Large Language Models for Medical Question Answering

Large Language Models (LLMs) have the potential of facilitating the development of Artificial Intelligence technology to assist medical experts for interactive decision support. This potential has been illustrated by the state-of-the-art performance obtained by LLMs in Medical Question Answering, with striking results such as passing marks in licensing medical exams. However, while impressive, the required quality bar for medical applications remains far from being achieved. Currently, LLMs remain challenged by outdated knowledge and by their tendency to generate hallucinated content. Furthermore, most benchmarks to assess medical knowledge lack reference gold explanations which means that it is not possible to evaluate the reasoning of LLMs predictions. Finally, the situation is particularly grim if we consider benchmarking LLMs for languages other than English which remains, as far as we know, a totally neglected topic. In order to address these shortcomings, in this paper we present MedExpQA, the first multilingual benchmark based on medical exams to evaluate LLMs in Medical Question Answering. To the best of our knowledge, MedExpQA includes for the first time reference gold explanations, written by medical doctors, of the correct and incorrect options in the exams. Comprehensive multilingual experimentation using both the gold reference explanations and Retrieval Augmented Generation (RAG) approaches show that performance of LLMs, with best results around 75 accuracy for English, still has large room for improvement, especially for languages other than English, for which accuracy drops 10 points. Therefore, despite using state-of-the-art RAG methods, our results also demonstrate the difficulty of obtaining and integrating readily available medical knowledge that may positively impact results on downstream evaluations for Medical Question Answering. Data, code, and fine-tuned models will be made publicly available.¹

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Artificial Intelligence in Medicine 工程技术-工程：生物医学

CiteScore

15.00

自引率

2.70%

发文量

143

审稿时长

6.3 months

期刊介绍： Artificial Intelligence in Medicine publishes original articles from a wide variety of interdisciplinary perspectives concerning the theory and practice of artificial intelligence (AI) in medicine, medically-oriented human biology, and health care. Artificial intelligence in medicine may be characterized as the scientific discipline pertaining to research studies, projects, and applications that aim at supporting decision-based medical tasks through knowledge- and/or data-intensive computer-based solutions that ultimately support and improve the performance of a human care provider.