大型语言模型能否推理医学问题？

IF 6.7 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Patterns Pub Date : 2024-03-01 DOI:10.1016/j.patter.2024.100943

Valentin Liévin, Christoffer Egeberg Hother, Andreas Geert Motzfeldt, Ole Winther

{"title":"大型语言模型能否推理医学问题？","authors":"Valentin Liévin, Christoffer Egeberg Hother, Andreas Geert Motzfeldt, Ole Winther","doi":"10.1016/j.patter.2024.100943","DOIUrl":null,"url":null,"abstract":"Although large language models often produce impressive outputs, it remains unclear how they perform in real-world scenarios requiring strong reasoning skills and expert domain knowledge. We set out to investigate whether closed- and open-source models (GPT-3.5, Llama 2, etc.) can be applied to answer and reason about difficult real-world-based questions. We focus on three popular medical benchmarks (MedQA-US Medical Licensing Examination [USMLE], MedMCQA, and PubMedQA) and multiple prompting scenarios: chain of thought (CoT; think step by step), few shot, and retrieval augmentation. Based on an expert annotation of the generated CoTs, we found that InstructGPT can often read, reason, and recall expert knowledge. Last, by leveraging advances in prompt engineering (few-shot and ensemble methods), we demonstrated that GPT-3.5 not only yields calibrated predictive distributions but also reaches the passing score on three datasets: MedQA-USMLE (60.2%), MedMCQA (62.7%), and PubMedQA (78.2%). Open-source models are closing the gap: Llama 2 70B also passed the MedQA-USMLE with 62.5% accuracy.","PeriodicalId":36242,"journal":{"name":"Patterns","volume":"19 1","pages":""},"PeriodicalIF":6.7000,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Can large language models reason about medical questions?\",\"authors\":\"Valentin Liévin, Christoffer Egeberg Hother, Andreas Geert Motzfeldt, Ole Winther\",\"doi\":\"10.1016/j.patter.2024.100943\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Although large language models often produce impressive outputs, it remains unclear how they perform in real-world scenarios requiring strong reasoning skills and expert domain knowledge. We set out to investigate whether closed- and open-source models (GPT-3.5, Llama 2, etc.) can be applied to answer and reason about difficult real-world-based questions. We focus on three popular medical benchmarks (MedQA-US Medical Licensing Examination [USMLE], MedMCQA, and PubMedQA) and multiple prompting scenarios: chain of thought (CoT; think step by step), few shot, and retrieval augmentation. Based on an expert annotation of the generated CoTs, we found that InstructGPT can often read, reason, and recall expert knowledge. Last, by leveraging advances in prompt engineering (few-shot and ensemble methods), we demonstrated that GPT-3.5 not only yields calibrated predictive distributions but also reaches the passing score on three datasets: MedQA-USMLE (60.2%), MedMCQA (62.7%), and PubMedQA (78.2%). Open-source models are closing the gap: Llama 2 70B also passed the MedQA-USMLE with 62.5% accuracy.\",\"PeriodicalId\":36242,\"journal\":{\"name\":\"Patterns\",\"volume\":\"19 1\",\"pages\":\"\"},\"PeriodicalIF\":6.7000,\"publicationDate\":\"2024-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Patterns\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1016/j.patter.2024.100943\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Patterns","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1016/j.patter.2024.100943","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

摘要

尽管大型语言模型通常能产生令人印象深刻的输出结果，但它们在需要强大推理技能和专家领域知识的现实世界场景中的表现如何，目前仍不清楚。我们着手研究封闭和开源模型（GPT-3.5、Llama 2 等）是否可用于回答和推理基于真实世界的难题。我们重点研究了三种流行的医学基准（MedQA-美国医学执业资格考试[USMLE]、MedMCQA 和 PubMedQA）和多种提示情景：思维链（CoT；逐步思考）、少数几个镜头和检索增强。根据专家对生成的 CoT 的注释，我们发现 InstructGPT 通常可以阅读、推理和回忆专家知识。最后，通过利用及时工程学的进步（少射和集合方法），我们证明了 GPT-3.5 不仅能生成校准预测分布，还能在三个数据集上达到及格分数：MedQA-USMLE（60.2%）、MedMCQA（62.7%）和 PubMedQA（78.2%）。开源模型正在缩小差距：Llama 2 70B 也以 62.5% 的准确率通过了 MedQA-USMLE 考试。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Can large language models reason about medical questions?

Although large language models often produce impressive outputs, it remains unclear how they perform in real-world scenarios requiring strong reasoning skills and expert domain knowledge. We set out to investigate whether closed- and open-source models (GPT-3.5, Llama 2, etc.) can be applied to answer and reason about difficult real-world-based questions. We focus on three popular medical benchmarks (MedQA-US Medical Licensing Examination [USMLE], MedMCQA, and PubMedQA) and multiple prompting scenarios: chain of thought (CoT; think step by step), few shot, and retrieval augmentation. Based on an expert annotation of the generated CoTs, we found that InstructGPT can often read, reason, and recall expert knowledge. Last, by leveraging advances in prompt engineering (few-shot and ensemble methods), we demonstrated that GPT-3.5 not only yields calibrated predictive distributions but also reaches the passing score on three datasets: MedQA-USMLE (60.2%), MedMCQA (62.7%), and PubMedQA (78.2%). Open-source models are closing the gap: Llama 2 70B also passed the MedQA-USMLE with 62.5% accuracy.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊