Clinical Risk Computation by Large Language Models Using Validated Risk Scores.

IF 5.7 3区医学 Q1 HEALTH CARE SCIENCES & SERVICES

Journal of Medical Systems Pub Date : 2025-09-30 DOI:10.1007/s10916-025-02261-5

Kaan Kara, Tuba Gunel

{"title":"Clinical Risk Computation by Large Language Models Using Validated Risk Scores.","authors":"Kaan Kara, Tuba Gunel","doi":"10.1007/s10916-025-02261-5","DOIUrl":null,"url":null,"abstract":"<p><p>Recent advances in artificial intelligence have propelled Large Language Models (LLMs) in natural language understanding, enabling new healthcare applications. While LLMs can analyze health data, directly predicting patient risk scores can be unreliable due to inaccuracies, biases, and difficulty interpreting complex medical data. A more trustworthy approach uses LLMs to calculate traditional clinical risk scores-validated, evidence-based formulas widely accepted in medicine. This improves validity, transparency, and safety by relying on established scoring systems rather than LLM-generated risk assessments, while still allowing LLMs to enhance clinical workflows through clear and interpretable explanations. In this study, we evaluated three public LLMs-GPT-4o-mini, DeepSeek v3, and Google Gemini 2.5 Flash-in calculating five clinical risk scores: CHA₂DS₂-VASc, HAS-BLED, Wells Score, Charlson Comorbidity Index, and Framingham Risk Score. We created 100 patient profiles (20 per score) representing diverse clinical scenarios and converted them into natural language clinical notes. These served as prompts for the LLMs to extract information and compute risk scores. We compared LLM-generated scores to reference scores from validated formulas using accuracy, precision, recall, F1 score, and Pearson correlation. GPT-4o-mini and Gemini 2.5 Flash outperformed DeepSeek v3, showing near-perfect agreement on most scores. However, all models struggled with the complex Framingham Risk Score, indicating challenges for general LLMs in complex risk calculations.</p>","PeriodicalId":16338,"journal":{"name":"Journal of Medical Systems","volume":"49 1","pages":"121"},"PeriodicalIF":5.7000,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Medical Systems","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s10916-025-02261-5","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Recent advances in artificial intelligence have propelled Large Language Models (LLMs) in natural language understanding, enabling new healthcare applications. While LLMs can analyze health data, directly predicting patient risk scores can be unreliable due to inaccuracies, biases, and difficulty interpreting complex medical data. A more trustworthy approach uses LLMs to calculate traditional clinical risk scores-validated, evidence-based formulas widely accepted in medicine. This improves validity, transparency, and safety by relying on established scoring systems rather than LLM-generated risk assessments, while still allowing LLMs to enhance clinical workflows through clear and interpretable explanations. In this study, we evaluated three public LLMs-GPT-4o-mini, DeepSeek v3, and Google Gemini 2.5 Flash-in calculating five clinical risk scores: CHA₂DS₂-VASc, HAS-BLED, Wells Score, Charlson Comorbidity Index, and Framingham Risk Score. We created 100 patient profiles (20 per score) representing diverse clinical scenarios and converted them into natural language clinical notes. These served as prompts for the LLMs to extract information and compute risk scores. We compared LLM-generated scores to reference scores from validated formulas using accuracy, precision, recall, F1 score, and Pearson correlation. GPT-4o-mini and Gemini 2.5 Flash outperformed DeepSeek v3, showing near-perfect agreement on most scores. However, all models struggled with the complex Framingham Risk Score, indicating challenges for general LLMs in complex risk calculations.

查看原文本刊更多论文

使用验证风险评分的大型语言模型进行临床风险计算。

人工智能的最新进展推动了自然语言理解中的大型语言模型（llm），使新的医疗保健应用成为可能。虽然法学硕士可以分析健康数据，但由于不准确、偏差和难以解释复杂的医疗数据，直接预测患者风险评分可能不可靠。一种更可靠的方法是使用法学硕士来计算传统的临床风险评分——医学上广泛接受的经过验证的循证公式。这提高了有效性，透明度和安全性，依靠已建立的评分系统，而不是llm生成的风险评估，同时仍然允许llm通过清晰和可解释的解释来加强临床工作流程。在这项研究中，我们评估了三个公共llms - gft - 40 -mini， DeepSeek v3和谷歌Gemini 2.5 Flash-in，计算了五个临床风险评分：CHA₂DS₂-VASc， HAS-BLED， Wells评分，Charlson共病指数和Framingham风险评分。我们创建了100个代表不同临床场景的患者档案（每个分数20个），并将其转换为自然语言临床笔记。这些提示法学硕士提取信息并计算风险评分。我们使用准确性、精密度、召回率、F1分数和Pearson相关性将llm生成的分数与经过验证的公式的参考分数进行了比较。gpt - 40 -mini和Gemini 2.5 Flash的表现优于DeepSeek v3，在大多数得分上表现出近乎完美的一致。然而，所有模型都在复杂的Framingham风险评分中挣扎，这表明一般法学硕士在复杂风险计算方面面临挑战。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Medical Systems 医学-卫生保健

CiteScore

11.60

自引率

1.90%

发文量

审稿时长

4.8 months

期刊介绍： Journal of Medical Systems provides a forum for the presentation and discussion of the increasingly extensive applications of new systems techniques and methods in hospital clinic and physician''s office administration; pathology radiology and pharmaceutical delivery systems; medical records storage and retrieval; and ancillary patient-support systems. The journal publishes informative articles essays and studies across the entire scale of medical systems from large hospital programs to novel small-scale medical services. Education is an integral part of this amalgamation of sciences and selected articles are published in this area. Since existing medical systems are constantly being modified to fit particular circumstances and to solve specific problems the journal includes a special section devoted to status reports on current installations.