Use of a large language model with instruction-tuning for reliable clinical frailty scoring.

Journal of the American Geriatrics Society Pub Date : 2024-08-06 DOI:10.1111/jgs.19114

Xiang Lee Jamie Kee, Gerald Gui Ren Sng, Daniel Yan Zheng Lim, Joshua Yi Min Tung, Hairil Rizal Abdullah, Anupama Roy Chowdury

{"title":"Use of a large language model with instruction-tuning for reliable clinical frailty scoring.","authors":"Xiang Lee Jamie Kee, Gerald Gui Ren Sng, Daniel Yan Zheng Lim, Joshua Yi Min Tung, Hairil Rizal Abdullah, Anupama Roy Chowdury","doi":"10.1111/jgs.19114","DOIUrl":null,"url":null,"abstract":"Background: Frailty is an important predictor of health outcomes, characterized by increased vulnerability due to physiological decline. The Clinical Frailty Scale (CFS) is commonly used for frailty assessment but may be influenced by rater bias. Use of artificial intelligence (AI), particularly Large Language Models (LLMs) offers a promising method for efficient and reliable frailty scoring.Methods: The study utilized seven standardized patient scenarios to evaluate the consistency and reliability of CFS scoring by OpenAI's GPT-3.5-turbo model. Two methods were tested: a basic prompt and an instruction-tuned prompt incorporating CFS definition, a directive for accurate responses, and temperature control. The outputs were compared using the Mann-Whitney U test and Fleiss' Kappa for inter-rater reliability. The outputs were compared with historic human scores of the same scenarios.Results: The LLM's median scores were similar to human raters, with differences of no more than one point. Significant differences in score distributions were observed between the basic and instruction-tuned prompts in five out of seven scenarios. The instruction-tuned prompt showed high inter-rater reliability (Fleiss' Kappa of 0.887) and produced consistent responses in all scenarios. Difficulty in scoring was noted in scenarios with less explicit information on activities of daily living (ADLs).Conclusions: This study demonstrates the potential of LLMs in consistently scoring clinical frailty with high reliability. It demonstrates that prompt engineering via instruction-tuning can be a simple but effective approach for optimizing LLMs in healthcare applications. The LLM may overestimate frailty scores when less information about ADLs is provided, possibly as it is less subject to implicit assumptions and extrapolation than humans. Future research could explore the integration of LLMs in clinical research and frailty-related outcome prediction.","PeriodicalId":94112,"journal":{"name":"Journal of the American Geriatrics Society","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of the American Geriatrics Society","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1111/jgs.19114","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Frailty is an important predictor of health outcomes, characterized by increased vulnerability due to physiological decline. The Clinical Frailty Scale (CFS) is commonly used for frailty assessment but may be influenced by rater bias. Use of artificial intelligence (AI), particularly Large Language Models (LLMs) offers a promising method for efficient and reliable frailty scoring.

Methods: The study utilized seven standardized patient scenarios to evaluate the consistency and reliability of CFS scoring by OpenAI's GPT-3.5-turbo model. Two methods were tested: a basic prompt and an instruction-tuned prompt incorporating CFS definition, a directive for accurate responses, and temperature control. The outputs were compared using the Mann-Whitney U test and Fleiss' Kappa for inter-rater reliability. The outputs were compared with historic human scores of the same scenarios.

Results: The LLM's median scores were similar to human raters, with differences of no more than one point. Significant differences in score distributions were observed between the basic and instruction-tuned prompts in five out of seven scenarios. The instruction-tuned prompt showed high inter-rater reliability (Fleiss' Kappa of 0.887) and produced consistent responses in all scenarios. Difficulty in scoring was noted in scenarios with less explicit information on activities of daily living (ADLs).

Conclusions: This study demonstrates the potential of LLMs in consistently scoring clinical frailty with high reliability. It demonstrates that prompt engineering via instruction-tuning can be a simple but effective approach for optimizing LLMs in healthcare applications. The LLM may overestimate frailty scores when less information about ADLs is provided, possibly as it is less subject to implicit assumptions and extrapolation than humans. Future research could explore the integration of LLMs in clinical research and frailty-related outcome prediction.

查看原文本刊更多论文

使用带有指令调整功能的大型语言模型进行可靠的临床虚弱评分。

背景：虚弱是预测健康结果的一个重要指标，其特点是由于生理机能下降而变得更加脆弱。临床虚弱量表（CFS）常用于虚弱评估，但可能会受到评分者偏差的影响。人工智能（AI）的使用，尤其是大型语言模型（LLMs）为高效可靠的虚弱评分提供了一种很有前景的方法：该研究利用七个标准化的患者场景来评估 OpenAI 的 GPT-3.5-turbo 模型进行 CFS 评分的一致性和可靠性。测试了两种方法：一种是基本提示，另一种是包含 CFS 定义、准确回答指令和温度控制的指令调整提示。使用 Mann-Whitney U 检验和 Fleiss' Kappa 检验评分者之间的可靠性，对输出结果进行比较。输出结果与相同场景的历史人工评分进行了比较：结果：LLM 的中位分数与人类评分者相似，差异不超过 1 分。在七个场景中有五个场景中，基本提示和经过指令调整的提示之间的分数分布存在显著差异。经过指导调整的提示显示出很高的评分者之间的可靠性（弗莱斯 Kappa 值为 0.887），在所有场景中都能产生一致的回答。在日常生活活动（ADLs）信息不太明确的情景中，评分存在困难：这项研究证明了 LLM 在以高可靠性对临床虚弱进行一致评分方面的潜力。该研究表明，在医疗保健应用中，通过指令调整进行及时工程设计是优化 LLM 的一种简单而有效的方法。当提供的 ADL 信息较少时，LLM 可能会高估虚弱评分，这可能是因为它比人类更少受到隐含假设和外推的影响。未来的研究可以探索将 LLMs 整合到临床研究和虚弱相关的结果预测中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of the American Geriatrics Society

自引率

0.00%

发文量

期刊介绍：