Kevin Kaufmes, Georg Mathes, Dilyana Vladimirova, Stephanie Berger, Christian Fegeler, Stefan Sigle
{"title":"评估中等规模,开源的大型语言模型:在精确肿瘤护理交付环境下的决策支持。","authors":"Kevin Kaufmes, Georg Mathes, Dilyana Vladimirova, Stephanie Berger, Christian Fegeler, Stefan Sigle","doi":"10.3233/SHTI251382","DOIUrl":null,"url":null,"abstract":"<p><strong>Introduction: </strong>In the context of precision oncology, patients often have complex conditions that require treatment based on specific and up-to-date knowledge of guidelines and research. This entails considerable effort when preparing such cases for molecular tumor boards (MTBs). Large language models (LLMs) could help to lower this burden if they could provide such information quickly and precisely on demand. Since out-of-the-box LLMs are not specialized for clinical contexts, this work aims to investigate their usefulness for answering questions arising during MTB preparation. As such questions can contain sensitive data, we evaluated medium-scale models suitable for running on-premise using consumer grade hardware.</p><p><strong>Methods: </strong>Three recent LLMs to be tested were selected based on established benchmarks and unique characteristics like reasoning capability. Exemplary questions related to MTBs were collected from domain experts. Six of those were selected for the LLMs to generate responses to. Response quality and correctness was evaluated by experts using a questionnaire.</p><p><strong>Results: </strong>Out of 60 contacted domain experts, 5 fully completed the survey, with another 5 completing it partially. The evaluation revealed a modest overall performance. Our findings identified significant issues, where a large percentage of answers contained outdated or incomplete information, as well as factual errors. Additionally, a high discordance between evaluators regarding correctness and varying rater confidence has been observed.</p><p><strong>Conclusion: </strong>Our results seem to be indicating that medium-scale LLMs are currently insufficiently reliable for use in precision oncology. Common issues include outdated information and confident presentation of misinformation, which indicates a gap between benchmark- and real-world performance. Future research should focus on mitigating limitations with advanced techniques such as Retrieval-Augmented-Generation (RAG), web search capability or advanced prompting, while prioritizing patient safety.</p>","PeriodicalId":94357,"journal":{"name":"Studies in health technology and informatics","volume":"331 ","pages":"81-90"},"PeriodicalIF":0.0000,"publicationDate":"2025-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Evaluating Medium Scale, Open-Source Large Language Models: Towards Decision Support in a Precision Oncology Care Delivery Context.\",\"authors\":\"Kevin Kaufmes, Georg Mathes, Dilyana Vladimirova, Stephanie Berger, Christian Fegeler, Stefan Sigle\",\"doi\":\"10.3233/SHTI251382\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Introduction: </strong>In the context of precision oncology, patients often have complex conditions that require treatment based on specific and up-to-date knowledge of guidelines and research. This entails considerable effort when preparing such cases for molecular tumor boards (MTBs). Large language models (LLMs) could help to lower this burden if they could provide such information quickly and precisely on demand. Since out-of-the-box LLMs are not specialized for clinical contexts, this work aims to investigate their usefulness for answering questions arising during MTB preparation. As such questions can contain sensitive data, we evaluated medium-scale models suitable for running on-premise using consumer grade hardware.</p><p><strong>Methods: </strong>Three recent LLMs to be tested were selected based on established benchmarks and unique characteristics like reasoning capability. Exemplary questions related to MTBs were collected from domain experts. Six of those were selected for the LLMs to generate responses to. Response quality and correctness was evaluated by experts using a questionnaire.</p><p><strong>Results: </strong>Out of 60 contacted domain experts, 5 fully completed the survey, with another 5 completing it partially. The evaluation revealed a modest overall performance. Our findings identified significant issues, where a large percentage of answers contained outdated or incomplete information, as well as factual errors. Additionally, a high discordance between evaluators regarding correctness and varying rater confidence has been observed.</p><p><strong>Conclusion: </strong>Our results seem to be indicating that medium-scale LLMs are currently insufficiently reliable for use in precision oncology. Common issues include outdated information and confident presentation of misinformation, which indicates a gap between benchmark- and real-world performance. Future research should focus on mitigating limitations with advanced techniques such as Retrieval-Augmented-Generation (RAG), web search capability or advanced prompting, while prioritizing patient safety.</p>\",\"PeriodicalId\":94357,\"journal\":{\"name\":\"Studies in health technology and informatics\",\"volume\":\"331 \",\"pages\":\"81-90\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2025-09-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Studies in health technology and informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3233/SHTI251382\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Studies in health technology and informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3233/SHTI251382","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Evaluating Medium Scale, Open-Source Large Language Models: Towards Decision Support in a Precision Oncology Care Delivery Context.
Introduction: In the context of precision oncology, patients often have complex conditions that require treatment based on specific and up-to-date knowledge of guidelines and research. This entails considerable effort when preparing such cases for molecular tumor boards (MTBs). Large language models (LLMs) could help to lower this burden if they could provide such information quickly and precisely on demand. Since out-of-the-box LLMs are not specialized for clinical contexts, this work aims to investigate their usefulness for answering questions arising during MTB preparation. As such questions can contain sensitive data, we evaluated medium-scale models suitable for running on-premise using consumer grade hardware.
Methods: Three recent LLMs to be tested were selected based on established benchmarks and unique characteristics like reasoning capability. Exemplary questions related to MTBs were collected from domain experts. Six of those were selected for the LLMs to generate responses to. Response quality and correctness was evaluated by experts using a questionnaire.
Results: Out of 60 contacted domain experts, 5 fully completed the survey, with another 5 completing it partially. The evaluation revealed a modest overall performance. Our findings identified significant issues, where a large percentage of answers contained outdated or incomplete information, as well as factual errors. Additionally, a high discordance between evaluators regarding correctness and varying rater confidence has been observed.
Conclusion: Our results seem to be indicating that medium-scale LLMs are currently insufficiently reliable for use in precision oncology. Common issues include outdated information and confident presentation of misinformation, which indicates a gap between benchmark- and real-world performance. Future research should focus on mitigating limitations with advanced techniques such as Retrieval-Augmented-Generation (RAG), web search capability or advanced prompting, while prioritizing patient safety.