Enrique Gutiérrez, Carlos Quesada, Emily DeFraites, Danielle J Harper, Amar D Mandavia
{"title":"Interpretable LLM-Based Detection of Loose Associations Using Synthetic Speech Data in Early Psychosis.","authors":"Enrique Gutiérrez, Carlos Quesada, Emily DeFraites, Danielle J Harper, Amar D Mandavia","doi":"10.1093/schbul/sbaf125","DOIUrl":null,"url":null,"abstract":"<p><strong>Background and hypothesis: </strong>Loose Associations (LA) in speech are key indicators of psychosis risk, notably in schizophrenia. Current detection methods are hampered by subjective evaluation, small samples, and poor generalizability. We hypothesize that combining Large Language Models (LLMs) with machine learning techniques could enhance objective identification of LA through improved semantic and probabilistic linguistic measures.</p><p><strong>Study design: </strong>We propose a novel and reproducible workflow for generating synthetic conversational instances of LA using LLMs, guided by linguistic theory and validated through clinical expert review. This synthetic dataset forms the basis for model training and is complemented by an independently collected dataset for evaluation. Features extracted included traditional clause similarity measures alongside novel surprisal metrics quantifying semantic coherence and unexpected lexical shifts. A parsimonious and interpretable Light Gradient Boosting Machine model was trained using only four features.</p><p><strong>Study results: </strong>The final model achieved high accuracy (83.46%; 95% CI: 82.96-83.95) on the synthetic dataset and robust performance on an independent set (82.36%; 95% CI: 81.94-82.78, AUC: 0.868). Our model outperformed baselines, including similarity-only models and prior thought disorder detection workflows. SHapley Additive exPlanations analysis confirmed the interpretability of the selected features, highlighting semantic coherence and word surprisal as key discriminators.</p><p><strong>Conclusions: </strong>Our approach demonstrates that LLM-derived linguistic features substantially enhance the objective, scalable detection of LA. The resulting model achieves high accuracy with minimal complexity, facilitating clinical applicability and interpretability. Future research should integrate additional lexical and contextual dimensions to further refine the identification of thought disorders, ultimately supporting early psychosis intervention.</p>","PeriodicalId":21530,"journal":{"name":"Schizophrenia Bulletin","volume":" ","pages":""},"PeriodicalIF":4.8000,"publicationDate":"2025-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Schizophrenia Bulletin","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1093/schbul/sbaf125","RegionNum":1,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PSYCHIATRY","Score":null,"Total":0}
引用次数: 0
Abstract
Background and hypothesis: Loose Associations (LA) in speech are key indicators of psychosis risk, notably in schizophrenia. Current detection methods are hampered by subjective evaluation, small samples, and poor generalizability. We hypothesize that combining Large Language Models (LLMs) with machine learning techniques could enhance objective identification of LA through improved semantic and probabilistic linguistic measures.
Study design: We propose a novel and reproducible workflow for generating synthetic conversational instances of LA using LLMs, guided by linguistic theory and validated through clinical expert review. This synthetic dataset forms the basis for model training and is complemented by an independently collected dataset for evaluation. Features extracted included traditional clause similarity measures alongside novel surprisal metrics quantifying semantic coherence and unexpected lexical shifts. A parsimonious and interpretable Light Gradient Boosting Machine model was trained using only four features.
Study results: The final model achieved high accuracy (83.46%; 95% CI: 82.96-83.95) on the synthetic dataset and robust performance on an independent set (82.36%; 95% CI: 81.94-82.78, AUC: 0.868). Our model outperformed baselines, including similarity-only models and prior thought disorder detection workflows. SHapley Additive exPlanations analysis confirmed the interpretability of the selected features, highlighting semantic coherence and word surprisal as key discriminators.
Conclusions: Our approach demonstrates that LLM-derived linguistic features substantially enhance the objective, scalable detection of LA. The resulting model achieves high accuracy with minimal complexity, facilitating clinical applicability and interpretability. Future research should integrate additional lexical and contextual dimensions to further refine the identification of thought disorders, ultimately supporting early psychosis intervention.
期刊介绍:
Schizophrenia Bulletin seeks to review recent developments and empirically based hypotheses regarding the etiology and treatment of schizophrenia. We view the field as broad and deep, and will publish new knowledge ranging from the molecular basis to social and cultural factors. We will give new emphasis to translational reports which simultaneously highlight basic neurobiological mechanisms and clinical manifestations. Some of the Bulletin content is invited as special features or manuscripts organized as a theme by special guest editors. Most pages of the Bulletin are devoted to unsolicited manuscripts of high quality that report original data or where we can provide a special venue for a major study or workshop report. Supplement issues are sometimes provided for manuscripts reporting from a recent conference.