Reasoning language models for more transparent prediction of suicide risk.

IF 4.9 0 PSYCHIATRY

BMJ mental health Pub Date : 2025-05-11 DOI:10.1136/bmjment-2025-301654

Thomas H McCoy,Roy H Perlis

{"title":"Reasoning language models for more transparent prediction of suicide risk.","authors":"Thomas H McCoy,Roy H Perlis","doi":"10.1136/bmjment-2025-301654","DOIUrl":null,"url":null,"abstract":"BACKGROUND\r\nWe previously demonstrated that a large language model could estimate suicide risk using hospital discharge notes.\r\n\r\nOBJECTIVE\r\nWith the emergence of reasoning models that can be run on consumer-grade hardware, we investigated whether these models can approximate the performance of much larger and costlier models.\r\n\r\nMETHODS\r\nFrom 458 053 adults hospitalised at one of two academic medical centres between 4 January 2005 and 2 January 2014, we identified 1995 who died by suicide or accident, and matched them with 5 control individuals. We used Llama-DeepSeek-R1 8B to generate predictions of risk. Beyond discrimination and calibration, we examined the aspects of model reasoning-that is, the topics in the chain of thought-associated with correct or incorrect predictions.\r\n\r\nFINDINGS\r\nThe cohort included 1995 individuals who died by suicide or accidental death and 9975 individuals matched 5:1, totalling 11 954 discharges and 58 933 person-years of follow-up. In Fine and Grey regression, hazard as estimated by the Llama3-distilled model was significantly associated with observed risk (unadjusted HR 4.65 (3.58-6.04)). The corresponding c-statistic was 0.64 (0.63-0.65), modestly poorer than the GPT4o model (0.67 (0.66-0.68)). In chain-of-thought reasoning, topics including Substance Abuse, Surgical Procedure, and Age-related Comorbidities were associated with correct predictions, while Fall-related Injury was associated with incorrect prediction.\r\n\r\nCONCLUSIONS\r\nApplication of a reasoning model using local, consumer-grade hardware only modestly diminished performance in stratifying suicide risk.\r\n\r\nCLINICAL IMPLICATIONS\r\nSmaller models can yield more secure, scalable and transparent risk prediction.","PeriodicalId":72434,"journal":{"name":"BMJ mental health","volume":"28 1","pages":""},"PeriodicalIF":4.9000,"publicationDate":"2025-05-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMJ mental health","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1136/bmjment-2025-301654","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"PSYCHIATRY","Score":null,"Total":0}

引用次数: 0

Abstract

BACKGROUND We previously demonstrated that a large language model could estimate suicide risk using hospital discharge notes. OBJECTIVE With the emergence of reasoning models that can be run on consumer-grade hardware, we investigated whether these models can approximate the performance of much larger and costlier models. METHODS From 458 053 adults hospitalised at one of two academic medical centres between 4 January 2005 and 2 January 2014, we identified 1995 who died by suicide or accident, and matched them with 5 control individuals. We used Llama-DeepSeek-R1 8B to generate predictions of risk. Beyond discrimination and calibration, we examined the aspects of model reasoning-that is, the topics in the chain of thought-associated with correct or incorrect predictions. FINDINGS The cohort included 1995 individuals who died by suicide or accidental death and 9975 individuals matched 5:1, totalling 11 954 discharges and 58 933 person-years of follow-up. In Fine and Grey regression, hazard as estimated by the Llama3-distilled model was significantly associated with observed risk (unadjusted HR 4.65 (3.58-6.04)). The corresponding c-statistic was 0.64 (0.63-0.65), modestly poorer than the GPT4o model (0.67 (0.66-0.68)). In chain-of-thought reasoning, topics including Substance Abuse, Surgical Procedure, and Age-related Comorbidities were associated with correct predictions, while Fall-related Injury was associated with incorrect prediction. CONCLUSIONS Application of a reasoning model using local, consumer-grade hardware only modestly diminished performance in stratifying suicide risk. CLINICAL IMPLICATIONS Smaller models can yield more secure, scalable and transparent risk prediction.

查看原文本刊更多论文

推理语言模型更透明地预测自杀风险。

我们之前证明了一个大型语言模型可以使用出院记录来估计自杀风险。随着可以在消费级硬件上运行的推理模型的出现，我们研究这些模型是否可以近似于更大、更昂贵的模型的性能。方法：从2005年1月4日至2014年1月2日在两家学术医疗中心之一住院的458053名成年人中，我们确定了1995名死于自杀或事故的人，并将他们与5名对照个体进行了匹配。我们使用Llama-DeepSeek-R1 8B来生成风险预测。除了区分和校准之外，我们还研究了模型推理的各个方面，即与正确或错误预测相关的思想链中的主题。研究结果：该队列包括1995名自杀或意外死亡的个体和9975名比例为5:1的个体，共11,954名出院者和58933人年的随访。在Fine和Grey回归中，llama3蒸馏模型估计的风险与观察到的风险显著相关（未经调整的HR 4.65(3.58-6.04)）。相应的c-统计量为0.64(0.63-0.65)，略低于gpt40模型（0.67(0.66-0.68)）。在思维链推理中，包括药物滥用、外科手术和年龄相关共病在内的主题与正确的预测相关，而与跌倒相关的伤害与错误的预测相关。结论使用本地消费级硬件的推理模型只会适度降低自杀风险分层的性能。临床意义较小的模型可以产生更安全、可扩展和透明的风险预测。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

BMJ mental health

CiteScore

6.80

自引率

0.00%

发文量