评估耳鼻喉科领域不同大型语言模型的未知潜在质量和局限性。

IF 16.4 1区 化学 Q1 CHEMISTRY, MULTIDISCIPLINARY
Accounts of Chemical Research Pub Date : 2024-03-01 Epub Date: 2024-05-23 DOI:10.1080/00016489.2024.2352843
Christoph R Buhr, Harry Smith, Tilman Huppertz, Katharina Bahr-Hamm, Christoph Matthias, Clemens Cuny, Jan Phillipp Snijders, Benjamin Philipp Ernst, Andrew Blaikie, Tom Kelsey, Sebastian Kuhn, Jonas Eckrich
{"title":"评估耳鼻喉科领域不同大型语言模型的未知潜在质量和局限性。","authors":"Christoph R Buhr, Harry Smith, Tilman Huppertz, Katharina Bahr-Hamm, Christoph Matthias, Clemens Cuny, Jan Phillipp Snijders, Benjamin Philipp Ernst, Andrew Blaikie, Tom Kelsey, Sebastian Kuhn, Jonas Eckrich","doi":"10.1080/00016489.2024.2352843","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Large Language Models (LLMs) might offer a solution for the lack of trained health personnel, particularly in low- and middle-income countries. However, their strengths and weaknesses remain unclear.</p><p><strong>Aims/objectives: </strong>Here we benchmark different LLMs (Bard 2023.07.13, Claude 2, ChatGPT 4) against six consultants in otorhinolaryngology (ORL).</p><p><strong>Material and methods: </strong>Case-based questions were extracted from literature and German state examinations. Answers from Bard 2023.07.13, Claude 2, ChatGPT 4, and six ORL consultants were rated blindly on a 6-point Likert-scale for medical adequacy, comprehensibility, coherence, and conciseness. Given answers were compared to validated answers and evaluated for hazards. A modified Turing test was performed and character counts were compared.</p><p><strong>Results: </strong>LLMs answers ranked inferior to consultants in all categories. Yet, the difference between consultants and LLMs was marginal, with the clearest disparity in conciseness and the smallest in comprehensibility. Among LLMs Claude 2 was rated best in medical adequacy and conciseness. Consultants' answers matched the validated solution in 93% (228/246), ChatGPT 4 in 85% (35/41), Claude 2 in 78% (32/41), and Bard 2023.07.13 in 59% (24/41). Answers were rated as potentially hazardous in 10% (24/246) for ChatGPT 4, 14% (34/246) for Claude 2, 19% (46/264) for Bard 2023.07.13, and 6% (71/1230) for consultants.</p><p><strong>Conclusions and significance: </strong>Despite consultants superior performance, LLMs show potential for clinical application in ORL. Future studies should assess their performance on larger scale.</p>","PeriodicalId":1,"journal":{"name":"Accounts of Chemical Research","volume":null,"pages":null},"PeriodicalIF":16.4000,"publicationDate":"2024-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Assessing unknown potential-quality and limitations of different large language models in the field of otorhinolaryngology.\",\"authors\":\"Christoph R Buhr, Harry Smith, Tilman Huppertz, Katharina Bahr-Hamm, Christoph Matthias, Clemens Cuny, Jan Phillipp Snijders, Benjamin Philipp Ernst, Andrew Blaikie, Tom Kelsey, Sebastian Kuhn, Jonas Eckrich\",\"doi\":\"10.1080/00016489.2024.2352843\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Background: </strong>Large Language Models (LLMs) might offer a solution for the lack of trained health personnel, particularly in low- and middle-income countries. However, their strengths and weaknesses remain unclear.</p><p><strong>Aims/objectives: </strong>Here we benchmark different LLMs (Bard 2023.07.13, Claude 2, ChatGPT 4) against six consultants in otorhinolaryngology (ORL).</p><p><strong>Material and methods: </strong>Case-based questions were extracted from literature and German state examinations. Answers from Bard 2023.07.13, Claude 2, ChatGPT 4, and six ORL consultants were rated blindly on a 6-point Likert-scale for medical adequacy, comprehensibility, coherence, and conciseness. Given answers were compared to validated answers and evaluated for hazards. A modified Turing test was performed and character counts were compared.</p><p><strong>Results: </strong>LLMs answers ranked inferior to consultants in all categories. Yet, the difference between consultants and LLMs was marginal, with the clearest disparity in conciseness and the smallest in comprehensibility. Among LLMs Claude 2 was rated best in medical adequacy and conciseness. Consultants' answers matched the validated solution in 93% (228/246), ChatGPT 4 in 85% (35/41), Claude 2 in 78% (32/41), and Bard 2023.07.13 in 59% (24/41). Answers were rated as potentially hazardous in 10% (24/246) for ChatGPT 4, 14% (34/246) for Claude 2, 19% (46/264) for Bard 2023.07.13, and 6% (71/1230) for consultants.</p><p><strong>Conclusions and significance: </strong>Despite consultants superior performance, LLMs show potential for clinical application in ORL. Future studies should assess their performance on larger scale.</p>\",\"PeriodicalId\":1,\"journal\":{\"name\":\"Accounts of Chemical Research\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":16.4000,\"publicationDate\":\"2024-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Accounts of Chemical Research\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1080/00016489.2024.2352843\",\"RegionNum\":1,\"RegionCategory\":\"化学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/5/23 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"Q1\",\"JCRName\":\"CHEMISTRY, MULTIDISCIPLINARY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Accounts of Chemical Research","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1080/00016489.2024.2352843","RegionNum":1,"RegionCategory":"化学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/5/23 0:00:00","PubModel":"Epub","JCR":"Q1","JCRName":"CHEMISTRY, MULTIDISCIPLINARY","Score":null,"Total":0}
引用次数: 0

摘要

背景:大型语言模型(LLMs)可以为缺乏训练有素的医疗人员提供解决方案,尤其是在中低收入国家。然而,它们的优缺点仍不明确:在此,我们以耳鼻喉科(ORL)的六名顾问为对象,对不同的语言模型(Bard 2023.07.13、Claude 2、ChatGPT 4)进行了基准测试:从文献和德国国家考试中提取了基于案例的问题。对 Bard 2023.07.13、Claude 2、ChatGPT 4 和六位耳鼻喉科顾问的答案进行了盲评,采用李克特(Likert)6 点量表,对医学充分性、可理解性、连贯性和简洁性进行评分。给出的答案与经过验证的答案进行了比较,并对危险性进行了评估。进行了修改后的图灵测试,并对字符数进行了比较:结果:在所有类别中,法律硕士的答案都不如顾问。然而,顾问和法律硕士之间的差距微乎其微,在简洁性方面差距最明显,而在可理解性方面差距最小。在法律硕士中,克劳德 2 在医学充分性和简洁性方面被评为最佳。顾问的答案有 93%(228/246)与验证方案相符,ChatGPT 4 有 85%(35/41),Claude 2 有 78%(32/41),Bard 2023.07.13 有 59%(24/41)。在 ChatGPT 4 中,10%(24/246)的答案被评为潜在危险;在 Claude 2 中,14%(34/246)的答案被评为潜在危险;在 Bard 2023.07.13 中,19%(46/264)的答案被评为潜在危险;在顾问中,6%(71/1230)的答案被评为潜在危险:尽管咨询师的性能更优越,但 LLM 在 ORL 的临床应用中仍有潜力。未来的研究应更大规模地评估其性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Assessing unknown potential-quality and limitations of different large language models in the field of otorhinolaryngology.

Background: Large Language Models (LLMs) might offer a solution for the lack of trained health personnel, particularly in low- and middle-income countries. However, their strengths and weaknesses remain unclear.

Aims/objectives: Here we benchmark different LLMs (Bard 2023.07.13, Claude 2, ChatGPT 4) against six consultants in otorhinolaryngology (ORL).

Material and methods: Case-based questions were extracted from literature and German state examinations. Answers from Bard 2023.07.13, Claude 2, ChatGPT 4, and six ORL consultants were rated blindly on a 6-point Likert-scale for medical adequacy, comprehensibility, coherence, and conciseness. Given answers were compared to validated answers and evaluated for hazards. A modified Turing test was performed and character counts were compared.

Results: LLMs answers ranked inferior to consultants in all categories. Yet, the difference between consultants and LLMs was marginal, with the clearest disparity in conciseness and the smallest in comprehensibility. Among LLMs Claude 2 was rated best in medical adequacy and conciseness. Consultants' answers matched the validated solution in 93% (228/246), ChatGPT 4 in 85% (35/41), Claude 2 in 78% (32/41), and Bard 2023.07.13 in 59% (24/41). Answers were rated as potentially hazardous in 10% (24/246) for ChatGPT 4, 14% (34/246) for Claude 2, 19% (46/264) for Bard 2023.07.13, and 6% (71/1230) for consultants.

Conclusions and significance: Despite consultants superior performance, LLMs show potential for clinical application in ORL. Future studies should assess their performance on larger scale.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
Accounts of Chemical Research
Accounts of Chemical Research 化学-化学综合
CiteScore
31.40
自引率
1.10%
发文量
312
审稿时长
2 months
期刊介绍: Accounts of Chemical Research presents short, concise and critical articles offering easy-to-read overviews of basic research and applications in all areas of chemistry and biochemistry. These short reviews focus on research from the author’s own laboratory and are designed to teach the reader about a research project. In addition, Accounts of Chemical Research publishes commentaries that give an informed opinion on a current research problem. Special Issues online are devoted to a single topic of unusual activity and significance. Accounts of Chemical Research replaces the traditional article abstract with an article "Conspectus." These entries synopsize the research affording the reader a closer look at the content and significance of an article. Through this provision of a more detailed description of the article contents, the Conspectus enhances the article's discoverability by search engines and the exposure for the research.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信