{"title":"Clinical Risk Computation by Large Language Models Using Validated Risk Scores.","authors":"Kaan Kara, Tuba Gunel","doi":"10.1007/s10916-025-02261-5","DOIUrl":null,"url":null,"abstract":"<p><p>Recent advances in artificial intelligence have propelled Large Language Models (LLMs) in natural language understanding, enabling new healthcare applications. While LLMs can analyze health data, directly predicting patient risk scores can be unreliable due to inaccuracies, biases, and difficulty interpreting complex medical data. A more trustworthy approach uses LLMs to calculate traditional clinical risk scores-validated, evidence-based formulas widely accepted in medicine. This improves validity, transparency, and safety by relying on established scoring systems rather than LLM-generated risk assessments, while still allowing LLMs to enhance clinical workflows through clear and interpretable explanations. In this study, we evaluated three public LLMs-GPT-4o-mini, DeepSeek v3, and Google Gemini 2.5 Flash-in calculating five clinical risk scores: CHA₂DS₂-VASc, HAS-BLED, Wells Score, Charlson Comorbidity Index, and Framingham Risk Score. We created 100 patient profiles (20 per score) representing diverse clinical scenarios and converted them into natural language clinical notes. These served as prompts for the LLMs to extract information and compute risk scores. We compared LLM-generated scores to reference scores from validated formulas using accuracy, precision, recall, F1 score, and Pearson correlation. GPT-4o-mini and Gemini 2.5 Flash outperformed DeepSeek v3, showing near-perfect agreement on most scores. However, all models struggled with the complex Framingham Risk Score, indicating challenges for general LLMs in complex risk calculations.</p>","PeriodicalId":16338,"journal":{"name":"Journal of Medical Systems","volume":"49 1","pages":"121"},"PeriodicalIF":5.7000,"publicationDate":"2025-09-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Medical Systems","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1007/s10916-025-02261-5","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}
引用次数: 0
Abstract
Recent advances in artificial intelligence have propelled Large Language Models (LLMs) in natural language understanding, enabling new healthcare applications. While LLMs can analyze health data, directly predicting patient risk scores can be unreliable due to inaccuracies, biases, and difficulty interpreting complex medical data. A more trustworthy approach uses LLMs to calculate traditional clinical risk scores-validated, evidence-based formulas widely accepted in medicine. This improves validity, transparency, and safety by relying on established scoring systems rather than LLM-generated risk assessments, while still allowing LLMs to enhance clinical workflows through clear and interpretable explanations. In this study, we evaluated three public LLMs-GPT-4o-mini, DeepSeek v3, and Google Gemini 2.5 Flash-in calculating five clinical risk scores: CHA₂DS₂-VASc, HAS-BLED, Wells Score, Charlson Comorbidity Index, and Framingham Risk Score. We created 100 patient profiles (20 per score) representing diverse clinical scenarios and converted them into natural language clinical notes. These served as prompts for the LLMs to extract information and compute risk scores. We compared LLM-generated scores to reference scores from validated formulas using accuracy, precision, recall, F1 score, and Pearson correlation. GPT-4o-mini and Gemini 2.5 Flash outperformed DeepSeek v3, showing near-perfect agreement on most scores. However, all models struggled with the complex Framingham Risk Score, indicating challenges for general LLMs in complex risk calculations.
期刊介绍:
Journal of Medical Systems provides a forum for the presentation and discussion of the increasingly extensive applications of new systems techniques and methods in hospital clinic and physician''s office administration; pathology radiology and pharmaceutical delivery systems; medical records storage and retrieval; and ancillary patient-support systems. The journal publishes informative articles essays and studies across the entire scale of medical systems from large hospital programs to novel small-scale medical services. Education is an integral part of this amalgamation of sciences and selected articles are published in this area. Since existing medical systems are constantly being modified to fit particular circumstances and to solve specific problems the journal includes a special section devoted to status reports on current installations.