Interpretable LLM-Based Detection of Loose Associations Using Synthetic Speech Data in Early Psychosis.

IF 4.8 1区医学 Q1 PSYCHIATRY

Schizophrenia Bulletin Pub Date : 2025-09-05 DOI:10.1093/schbul/sbaf125

Enrique Gutiérrez, Carlos Quesada, Emily DeFraites, Danielle J Harper, Amar D Mandavia

{"title":"Interpretable LLM-Based Detection of Loose Associations Using Synthetic Speech Data in Early Psychosis.","authors":"Enrique Gutiérrez, Carlos Quesada, Emily DeFraites, Danielle J Harper, Amar D Mandavia","doi":"10.1093/schbul/sbaf125","DOIUrl":null,"url":null,"abstract":"Background and hypothesis: Loose Associations (LA) in speech are key indicators of psychosis risk, notably in schizophrenia. Current detection methods are hampered by subjective evaluation, small samples, and poor generalizability. We hypothesize that combining Large Language Models (LLMs) with machine learning techniques could enhance objective identification of LA through improved semantic and probabilistic linguistic measures.Study design: We propose a novel and reproducible workflow for generating synthetic conversational instances of LA using LLMs, guided by linguistic theory and validated through clinical expert review. This synthetic dataset forms the basis for model training and is complemented by an independently collected dataset for evaluation. Features extracted included traditional clause similarity measures alongside novel surprisal metrics quantifying semantic coherence and unexpected lexical shifts. A parsimonious and interpretable Light Gradient Boosting Machine model was trained using only four features.Study results: The final model achieved high accuracy (83.46%; 95% CI: 82.96-83.95) on the synthetic dataset and robust performance on an independent set (82.36%; 95% CI: 81.94-82.78, AUC: 0.868). Our model outperformed baselines, including similarity-only models and prior thought disorder detection workflows. SHapley Additive exPlanations analysis confirmed the interpretability of the selected features, highlighting semantic coherence and word surprisal as key discriminators.Conclusions: Our approach demonstrates that LLM-derived linguistic features substantially enhance the objective, scalable detection of LA. The resulting model achieves high accuracy with minimal complexity, facilitating clinical applicability and interpretability. Future research should integrate additional lexical and contextual dimensions to further refine the identification of thought disorders, ultimately supporting early psychosis intervention.","PeriodicalId":21530,"journal":{"name":"Schizophrenia Bulletin","volume":" ","pages":""},"PeriodicalIF":4.8000,"publicationDate":"2025-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Schizophrenia Bulletin","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1093/schbul/sbaf125","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PSYCHIATRY","Score":null,"Total":0}

引用次数: 0

Abstract

Background and hypothesis: Loose Associations (LA) in speech are key indicators of psychosis risk, notably in schizophrenia. Current detection methods are hampered by subjective evaluation, small samples, and poor generalizability. We hypothesize that combining Large Language Models (LLMs) with machine learning techniques could enhance objective identification of LA through improved semantic and probabilistic linguistic measures.

Study design: We propose a novel and reproducible workflow for generating synthetic conversational instances of LA using LLMs, guided by linguistic theory and validated through clinical expert review. This synthetic dataset forms the basis for model training and is complemented by an independently collected dataset for evaluation. Features extracted included traditional clause similarity measures alongside novel surprisal metrics quantifying semantic coherence and unexpected lexical shifts. A parsimonious and interpretable Light Gradient Boosting Machine model was trained using only four features.

Study results: The final model achieved high accuracy (83.46%; 95% CI: 82.96-83.95) on the synthetic dataset and robust performance on an independent set (82.36%; 95% CI: 81.94-82.78, AUC: 0.868). Our model outperformed baselines, including similarity-only models and prior thought disorder detection workflows. SHapley Additive exPlanations analysis confirmed the interpretability of the selected features, highlighting semantic coherence and word surprisal as key discriminators.

Conclusions: Our approach demonstrates that LLM-derived linguistic features substantially enhance the objective, scalable detection of LA. The resulting model achieves high accuracy with minimal complexity, facilitating clinical applicability and interpretability. Future research should integrate additional lexical and contextual dimensions to further refine the identification of thought disorders, ultimately supporting early psychosis intervention.

查看原文本刊更多论文

基于可解释llm的松散关联检测：早期精神病患者使用合成语音数据。

背景与假设：言语中的松散联想（LA）是精神病风险的关键指标，尤其是在精神分裂症中。目前的检测方法受到主观评价、小样本和较差的泛化性的阻碍。我们假设将大型语言模型（LLMs）与机器学习技术相结合可以通过改进语义和概率语言度量来增强对LA的客观识别。研究设计：我们提出了一种新颖的、可重复的工作流程，用于使用llm生成LA的合成会话实例，以语言学理论为指导，并通过临床专家评审进行验证。该合成数据集构成了模型训练的基础，并由独立收集的数据集进行评估。提取的特征包括传统的子句相似度测量以及量化语义一致性和意想不到的词汇移位的新颖惊喜度量。仅使用四个特征训练了一个简洁且可解释的光梯度增强机模型。研究结果：最终模型在合成数据集上具有较高的准确率（83.46%;95% CI: 82.96 ~ 83.95），在独立数据集上具有良好的鲁棒性（82.36%;95% CI: 81.94 ~ 82.78, AUC: 0.868）。我们的模型优于基线，包括仅相似性模型和先验思维障碍检测工作流程。SHapley加性解释分析证实了所选特征的可解释性，强调语义连贯和单词惊讶是关键的判别因子。结论：我们的方法表明，llm衍生的语言特征大大增强了对LA的客观、可扩展检测。所得到的模型以最小的复杂性实现了高精度，促进了临床适用性和可解释性。未来的研究应该整合额外的词汇和语境维度，以进一步完善思维障碍的识别，最终支持早期精神病干预。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Schizophrenia Bulletin 医学-精神病学

CiteScore

11.40

自引率

6.10%

发文量

163

审稿时长

4-8 weeks

期刊介绍： Schizophrenia Bulletin seeks to review recent developments and empirically based hypotheses regarding the etiology and treatment of schizophrenia. We view the field as broad and deep, and will publish new knowledge ranging from the molecular basis to social and cultural factors. We will give new emphasis to translational reports which simultaneously highlight basic neurobiological mechanisms and clinical manifestations. Some of the Bulletin content is invited as special features or manuscripts organized as a theme by special guest editors. Most pages of the Bulletin are devoted to unsolicited manuscripts of high quality that report original data or where we can provide a special venue for a major study or workshop report. Supplement issues are sometimes provided for manuscripts reporting from a recent conference.