{"title":"Reasoning language models for more transparent prediction of suicide risk.","authors":"Thomas H McCoy,Roy H Perlis","doi":"10.1136/bmjment-2025-301654","DOIUrl":null,"url":null,"abstract":"BACKGROUND\r\nWe previously demonstrated that a large language model could estimate suicide risk using hospital discharge notes.\r\n\r\nOBJECTIVE\r\nWith the emergence of reasoning models that can be run on consumer-grade hardware, we investigated whether these models can approximate the performance of much larger and costlier models.\r\n\r\nMETHODS\r\nFrom 458 053 adults hospitalised at one of two academic medical centres between 4 January 2005 and 2 January 2014, we identified 1995 who died by suicide or accident, and matched them with 5 control individuals. We used Llama-DeepSeek-R1 8B to generate predictions of risk. Beyond discrimination and calibration, we examined the aspects of model reasoning-that is, the topics in the chain of thought-associated with correct or incorrect predictions.\r\n\r\nFINDINGS\r\nThe cohort included 1995 individuals who died by suicide or accidental death and 9975 individuals matched 5:1, totalling 11 954 discharges and 58 933 person-years of follow-up. In Fine and Grey regression, hazard as estimated by the Llama3-distilled model was significantly associated with observed risk (unadjusted HR 4.65 (3.58-6.04)). The corresponding c-statistic was 0.64 (0.63-0.65), modestly poorer than the GPT4o model (0.67 (0.66-0.68)). In chain-of-thought reasoning, topics including Substance Abuse, Surgical Procedure, and Age-related Comorbidities were associated with correct predictions, while Fall-related Injury was associated with incorrect prediction.\r\n\r\nCONCLUSIONS\r\nApplication of a reasoning model using local, consumer-grade hardware only modestly diminished performance in stratifying suicide risk.\r\n\r\nCLINICAL IMPLICATIONS\r\nSmaller models can yield more secure, scalable and transparent risk prediction.","PeriodicalId":72434,"journal":{"name":"BMJ mental health","volume":"28 1","pages":""},"PeriodicalIF":4.9000,"publicationDate":"2025-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMJ mental health","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1136/bmjment-2025-301654","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"PSYCHIATRY","Score":null,"Total":0}
引用次数: 0
Abstract
BACKGROUND
We previously demonstrated that a large language model could estimate suicide risk using hospital discharge notes.
OBJECTIVE
With the emergence of reasoning models that can be run on consumer-grade hardware, we investigated whether these models can approximate the performance of much larger and costlier models.
METHODS
From 458 053 adults hospitalised at one of two academic medical centres between 4 January 2005 and 2 January 2014, we identified 1995 who died by suicide or accident, and matched them with 5 control individuals. We used Llama-DeepSeek-R1 8B to generate predictions of risk. Beyond discrimination and calibration, we examined the aspects of model reasoning-that is, the topics in the chain of thought-associated with correct or incorrect predictions.
FINDINGS
The cohort included 1995 individuals who died by suicide or accidental death and 9975 individuals matched 5:1, totalling 11 954 discharges and 58 933 person-years of follow-up. In Fine and Grey regression, hazard as estimated by the Llama3-distilled model was significantly associated with observed risk (unadjusted HR 4.65 (3.58-6.04)). The corresponding c-statistic was 0.64 (0.63-0.65), modestly poorer than the GPT4o model (0.67 (0.66-0.68)). In chain-of-thought reasoning, topics including Substance Abuse, Surgical Procedure, and Age-related Comorbidities were associated with correct predictions, while Fall-related Injury was associated with incorrect prediction.
CONCLUSIONS
Application of a reasoning model using local, consumer-grade hardware only modestly diminished performance in stratifying suicide risk.
CLINICAL IMPLICATIONS
Smaller models can yield more secure, scalable and transparent risk prediction.