Generative Large Language Model-Powered Conversational AI App for Personalized Risk Assessment: Case Study in COVID-19.

JMIR AI Pub Date : 2025-03-27 DOI:10.2196/67363

Mohammad Amin Roshani, Xiangyu Zhou, Yao Qiang, Srinivasan Suresh, Steven Hicks, Usha Sethuraman, Dongxiao Zhu

{"title":"Generative Large Language Model-Powered Conversational AI App for Personalized Risk Assessment: Case Study in COVID-19.","authors":"Mohammad Amin Roshani, Xiangyu Zhou, Yao Qiang, Srinivasan Suresh, Steven Hicks, Usha Sethuraman, Dongxiao Zhu","doi":"10.2196/67363","DOIUrl":null,"url":null,"abstract":"Background: Large language models (LLMs) have demonstrated powerful capabilities in natural language tasks and are increasingly being integrated into health care for tasks like disease risk assessment. Traditional machine learning methods rely on structured data and coding, limiting their flexibility in dynamic clinical environments. This study presents a novel approach to disease risk assessment using generative LLMs through conversational artificial intelligence (AI), eliminating the need for programming.Objective: This study evaluates the use of pretrained generative LLMs, including LLaMA2-7b and Flan-T5-xl, for COVID-19 severity prediction with the goal of enabling a real-time, no-code, risk assessment solution through chatbot-based, question-answering interactions. To contextualize their performance, we compare LLMs with traditional machine learning classifiers, such as logistic regression, extreme gradient boosting (XGBoost), and random forest, which rely on tabular data.Methods: We fine-tuned LLMs using few-shot natural language examples from a dataset of 393 pediatric patients, developing a mobile app that integrates these models to provide real-time, no-code, COVID-19 severity risk assessment through clinician-patient interaction. The LLMs were compared with traditional classifiers across different experimental settings, using the area under the curve (AUC) as the primary evaluation metric. Feature importance derived from LLM attention layers was also analyzed to enhance interpretability.Results: Generative LLMs demonstrated strong performance in low-data settings. In zero-shot scenarios, the T0-3b-T model achieved an AUC of 0.75, while other LLMs, such as T0pp(8bit)-T and Flan-T5-xl-T, reached 0.67 and 0.69, respectively. At 2-shot settings, logistic regression and random forest achieved an AUC of 0.57, while Flan-T5-xl-T and T0-3b-T obtained 0.69 and 0.65, respectively. By 32-shot settings, Flan-T5-xl-T reached 0.70, similar to logistic regression (0.69) and random forest (0.68), while XGBoost improved to 0.65. These results illustrate the differences in how generative LLMs and traditional models handle the increasing data availability. LLMs perform well in low-data scenarios, whereas traditional models rely more on structured tabular data and labeled training examples. Furthermore, the mobile app provides real-time, COVID-19 severity assessments and personalized insights through attention-based feature importance, adding value to the clinical interpretation of the results.Conclusions: Generative LLMs provide a robust alternative to traditional classifiers, particularly in scenarios with limited labeled data. Their ability to handle unstructured inputs and deliver personalized, real-time assessments without coding makes them highly adaptable to clinical settings. This study underscores the potential of LLM-powered conversational artificial intelligence (AI) in health care and encourages further exploration of its use for real-time, disease risk assessment and decision-making support.","PeriodicalId":73551,"journal":{"name":"JMIR AI","volume":"4 ","pages":"e67363"},"PeriodicalIF":0.0000,"publicationDate":"2025-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11986386/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR AI","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/67363","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Large language models (LLMs) have demonstrated powerful capabilities in natural language tasks and are increasingly being integrated into health care for tasks like disease risk assessment. Traditional machine learning methods rely on structured data and coding, limiting their flexibility in dynamic clinical environments. This study presents a novel approach to disease risk assessment using generative LLMs through conversational artificial intelligence (AI), eliminating the need for programming.

Objective: This study evaluates the use of pretrained generative LLMs, including LLaMA2-7b and Flan-T5-xl, for COVID-19 severity prediction with the goal of enabling a real-time, no-code, risk assessment solution through chatbot-based, question-answering interactions. To contextualize their performance, we compare LLMs with traditional machine learning classifiers, such as logistic regression, extreme gradient boosting (XGBoost), and random forest, which rely on tabular data.

Methods: We fine-tuned LLMs using few-shot natural language examples from a dataset of 393 pediatric patients, developing a mobile app that integrates these models to provide real-time, no-code, COVID-19 severity risk assessment through clinician-patient interaction. The LLMs were compared with traditional classifiers across different experimental settings, using the area under the curve (AUC) as the primary evaluation metric. Feature importance derived from LLM attention layers was also analyzed to enhance interpretability.

Results: Generative LLMs demonstrated strong performance in low-data settings. In zero-shot scenarios, the T0-3b-T model achieved an AUC of 0.75, while other LLMs, such as T0pp(8bit)-T and Flan-T5-xl-T, reached 0.67 and 0.69, respectively. At 2-shot settings, logistic regression and random forest achieved an AUC of 0.57, while Flan-T5-xl-T and T0-3b-T obtained 0.69 and 0.65, respectively. By 32-shot settings, Flan-T5-xl-T reached 0.70, similar to logistic regression (0.69) and random forest (0.68), while XGBoost improved to 0.65. These results illustrate the differences in how generative LLMs and traditional models handle the increasing data availability. LLMs perform well in low-data scenarios, whereas traditional models rely more on structured tabular data and labeled training examples. Furthermore, the mobile app provides real-time, COVID-19 severity assessments and personalized insights through attention-based feature importance, adding value to the clinical interpretation of the results.

Conclusions: Generative LLMs provide a robust alternative to traditional classifiers, particularly in scenarios with limited labeled data. Their ability to handle unstructured inputs and deliver personalized, real-time assessments without coding makes them highly adaptable to clinical settings. This study underscores the potential of LLM-powered conversational artificial intelligence (AI) in health care and encourages further exploration of its use for real-time, disease risk assessment and decision-making support.

查看原文本刊更多论文

用于个性化风险评估的生成式大型语言模型会话AI应用程序：COVID-19案例研究。

背景：大型语言模型（llm）在自然语言任务中表现出强大的能力，并越来越多地集成到医疗保健任务中，如疾病风险评估。传统的机器学习方法依赖于结构化数据和编码，限制了它们在动态临床环境中的灵活性。本研究提出了一种通过会话人工智能（AI）使用生成式llm进行疾病风险评估的新方法，消除了编程的需要。目的：本研究评估预训练生成式llm（包括LLaMA2-7b和Flan-T5-xl）在COVID-19严重程度预测中的应用，目的是通过基于聊天机器人的问答交互实现实时、无代码、风险评估解决方案。为了将llm的性能与传统的机器学习分类器进行比较，例如逻辑回归、极端梯度增强（XGBoost）和随机森林，这些分类器依赖于表格数据。方法：我们使用来自393名儿科患者数据集的少量自然语言示例对llm进行微调，开发一款集成这些模型的移动应用程序，通过临床-患者互动提供实时、无代码的COVID-19严重程度风险评估。采用曲线下面积（AUC）作为主要评价指标，将llm与传统分类器在不同实验设置下进行比较。为了提高可解释性，还分析了LLM注意层派生的特征重要性。结果：生成法学硕士在低数据设置中表现出强大的性能。在零射击场景下，T0-3b-T模型的AUC为0.75，而其他llm，如T0pp(8bit)-T和Flan-T5-xl-T分别达到0.67和0.69。在2次设置下，logistic回归和随机森林的AUC为0.57，而Flan-T5-xl-T和T0-3b-T的AUC分别为0.69和0.65。经过32次设置，Flan-T5-xl-T达到0.70，与logistic回归（0.69）和随机森林（0.68）相似，XGBoost提高到0.65。这些结果说明了生成法学模型和传统模型在处理日益增加的数据可用性方面的差异。llm在低数据场景中表现良好，而传统模型更多地依赖于结构化表格数据和标记训练示例。此外，该移动应用程序通过基于注意力的特征重要性提供实时的COVID-19严重程度评估和个性化见解，为结果的临床解释增加了价值。结论：生成式llm提供了传统分类器的强大替代方案，特别是在标记数据有限的情况下。他们能够处理非结构化输入并提供个性化的实时评估，而无需编码，这使他们能够高度适应临床环境。这项研究强调了llm驱动的对话式人工智能（AI）在医疗保健领域的潜力，并鼓励进一步探索其在实时、疾病风险评估和决策支持方面的应用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

JMIR AI

自引率

0.00%

发文量