评估中等规模，开源的大型语言模型：在精确肿瘤护理交付环境下的决策支持。

Studies in health technology and informatics Pub Date : 2025-09-03 DOI:10.3233/SHTI251382

Kevin Kaufmes, Georg Mathes, Dilyana Vladimirova, Stephanie Berger, Christian Fegeler, Stefan Sigle

{"title":"评估中等规模，开源的大型语言模型：在精确肿瘤护理交付环境下的决策支持。","authors":"Kevin Kaufmes, Georg Mathes, Dilyana Vladimirova, Stephanie Berger, Christian Fegeler, Stefan Sigle","doi":"10.3233/SHTI251382","DOIUrl":null,"url":null,"abstract":"Introduction: In the context of precision oncology, patients often have complex conditions that require treatment based on specific and up-to-date knowledge of guidelines and research. This entails considerable effort when preparing such cases for molecular tumor boards (MTBs). Large language models (LLMs) could help to lower this burden if they could provide such information quickly and precisely on demand. Since out-of-the-box LLMs are not specialized for clinical contexts, this work aims to investigate their usefulness for answering questions arising during MTB preparation. As such questions can contain sensitive data, we evaluated medium-scale models suitable for running on-premise using consumer grade hardware.Methods: Three recent LLMs to be tested were selected based on established benchmarks and unique characteristics like reasoning capability. Exemplary questions related to MTBs were collected from domain experts. Six of those were selected for the LLMs to generate responses to. Response quality and correctness was evaluated by experts using a questionnaire.Results: Out of 60 contacted domain experts, 5 fully completed the survey, with another 5 completing it partially. The evaluation revealed a modest overall performance. Our findings identified significant issues, where a large percentage of answers contained outdated or incomplete information, as well as factual errors. Additionally, a high discordance between evaluators regarding correctness and varying rater confidence has been observed.Conclusion: Our results seem to be indicating that medium-scale LLMs are currently insufficiently reliable for use in precision oncology. Common issues include outdated information and confident presentation of misinformation, which indicates a gap between benchmark- and real-world performance. Future research should focus on mitigating limitations with advanced techniques such as Retrieval-Augmented-Generation (RAG), web search capability or advanced prompting, while prioritizing patient safety.","PeriodicalId":94357,"journal":{"name":"Studies in health technology and informatics","volume":"331 ","pages":"81-90"},"PeriodicalIF":0.0000,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluating Medium Scale, Open-Source Large Language Models: Towards Decision Support in a Precision Oncology Care Delivery Context.\",\"authors\":\"Kevin Kaufmes, Georg Mathes, Dilyana Vladimirova, Stephanie Berger, Christian Fegeler, Stefan Sigle\",\"doi\":\"10.3233/SHTI251382\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Introduction: In the context of precision oncology, patients often have complex conditions that require treatment based on specific and up-to-date knowledge of guidelines and research. This entails considerable effort when preparing such cases for molecular tumor boards (MTBs). Large language models (LLMs) could help to lower this burden if they could provide such information quickly and precisely on demand. Since out-of-the-box LLMs are not specialized for clinical contexts, this work aims to investigate their usefulness for answering questions arising during MTB preparation. As such questions can contain sensitive data, we evaluated medium-scale models suitable for running on-premise using consumer grade hardware.Methods: Three recent LLMs to be tested were selected based on established benchmarks and unique characteristics like reasoning capability. Exemplary questions related to MTBs were collected from domain experts. Six of those were selected for the LLMs to generate responses to. Response quality and correctness was evaluated by experts using a questionnaire.Results: Out of 60 contacted domain experts, 5 fully completed the survey, with another 5 completing it partially. The evaluation revealed a modest overall performance. Our findings identified significant issues, where a large percentage of answers contained outdated or incomplete information, as well as factual errors. Additionally, a high discordance between evaluators regarding correctness and varying rater confidence has been observed.Conclusion: Our results seem to be indicating that medium-scale LLMs are currently insufficiently reliable for use in precision oncology. Common issues include outdated information and confident presentation of misinformation, which indicates a gap between benchmark- and real-world performance. Future research should focus on mitigating limitations with advanced techniques such as Retrieval-Augmented-Generation (RAG), web search capability or advanced prompting, while prioritizing patient safety.\",\"PeriodicalId\":94357,\"journal\":{\"name\":\"Studies in health technology and informatics\",\"volume\":\"331 \",\"pages\":\"81-90\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-09-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Studies in health technology and informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3233/SHTI251382\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Studies in health technology and informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3233/SHTI251382","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在精确肿瘤学的背景下，患者往往有复杂的情况，需要基于特定的和最新的指导方针和研究知识的治疗。在为分子肿瘤板（MTBs）准备此类病例时，这需要相当大的努力。大型语言模型（llm）如果能够根据需要快速准确地提供这些信息，就可以帮助减轻这种负担。由于开箱即用的法学硕士不是专门为临床环境，这项工作的目的是调查他们的有用性，以回答在MTB准备过程中出现的问题。由于这些问题可能包含敏感数据，我们评估了适合使用消费级硬件在本地运行的中等规模模型。方法：根据已建立的基准和推理能力等独特特征，选择三个最新的llm进行测试。从领域专家那里收集了与mtb相关的示例性问题。其中6个被法学硕士选中以产生响应。回答的质量和正确性由专家使用问卷进行评估。结果：在联系的60位领域专家中，5位完全完成了调查，另外5位部分完成了调查。评价显示总体表现一般。我们的发现发现了一些重大问题，其中很大一部分答案包含过时或不完整的信息，以及事实错误。此外，评估者之间关于正确性和不同的可信度的高度不一致已被观察到。结论：我们的结果似乎表明，中等规模的llm目前在精确肿瘤学中应用不够可靠。常见的问题包括过时的信息和错误信息的自信呈现，这表明基准测试和实际性能之间存在差距。未来的研究应侧重于利用检索增强生成（RAG）、网络搜索能力或高级提示等先进技术减轻局限性，同时优先考虑患者安全。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Evaluating Medium Scale, Open-Source Large Language Models: Towards Decision Support in a Precision Oncology Care Delivery Context.

Introduction: In the context of precision oncology, patients often have complex conditions that require treatment based on specific and up-to-date knowledge of guidelines and research. This entails considerable effort when preparing such cases for molecular tumor boards (MTBs). Large language models (LLMs) could help to lower this burden if they could provide such information quickly and precisely on demand. Since out-of-the-box LLMs are not specialized for clinical contexts, this work aims to investigate their usefulness for answering questions arising during MTB preparation. As such questions can contain sensitive data, we evaluated medium-scale models suitable for running on-premise using consumer grade hardware.

Methods: Three recent LLMs to be tested were selected based on established benchmarks and unique characteristics like reasoning capability. Exemplary questions related to MTBs were collected from domain experts. Six of those were selected for the LLMs to generate responses to. Response quality and correctness was evaluated by experts using a questionnaire.

Results: Out of 60 contacted domain experts, 5 fully completed the survey, with another 5 completing it partially. The evaluation revealed a modest overall performance. Our findings identified significant issues, where a large percentage of answers contained outdated or incomplete information, as well as factual errors. Additionally, a high discordance between evaluators regarding correctness and varying rater confidence has been observed.

Conclusion: Our results seem to be indicating that medium-scale LLMs are currently insufficiently reliable for use in precision oncology. Common issues include outdated information and confident presentation of misinformation, which indicates a gap between benchmark- and real-world performance. Future research should focus on mitigating limitations with advanced techniques such as Retrieval-Augmented-Generation (RAG), web search capability or advanced prompting, while prioritizing patient safety.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Studies in health technology and informatics

自引率

0.00%

发文量