Wen Xin , Huailiang Yi , Jiaqi Song , Zhanxiao Tian , Haitao Chen , Shuping Tan
{"title":"Can AI replace clinician-rated depression scales? The psychometric properties of HAMLET – Hamilton Large-Language-Model Evaluation Tool","authors":"Wen Xin , Huailiang Yi , Jiaqi Song , Zhanxiao Tian , Haitao Chen , Shuping Tan","doi":"10.1016/j.ajp.2025.104707","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><div>The Hamilton Depression Rating Scale (HAMD) is the gold standard for assessing depression but requires clinician administration, limiting its accessibility. Large language models (LLMs) offer potential for automated, valid assessments. We developed HAMLET (Hamilton Large-language-model Evaluation Tool), an interactive LLM-based tool designed to replicate the HAMD-17, and compared its results with those obtained by a psychiatrist.</div></div><div><h3>Methods</h3><div>HAMLET utilizes Qwen-Max in the temperature of 0.7 and a top-p value of 0.6, guided by structured prompts engineered via clinician-supervised tuning. 60 patients with Major Depressive Disorder completed: (1) HAMLET, (2) clinician-rated HAMD-17, and (3) PHQ-9. Agreement was assessed using Intraclass Correlation Coefficient (ICC), Bland-Altman plots, and Gwet’s AC2. Correlations with HAMD and incremental validity beyond the PHQ-9 were evaluated.</div></div><div><h3>Results</h3><div>HAMLET demonstrated strong overall agreement with clinician-rated HAMD scores (ICC=0.911; 95 % CI: 0.855–0.946). It outperformed the PHQ-9 in correlating with HAMD scores (r = 0.92 vs 0.79; Steiger’s Z = 3.798, p < 0.001) and demonstrated incremental validity (ΔR²=0.252). Item-level agreement (Gwet’s AC2) was > 0.60 for all items, though was lower for sensitive questions.</div></div><div><h3>Conclusion</h3><div>HAMLET is the first LLM framework to autonomously conduct HAMD-17 assessments demonstrating substantial agreement with clinician ratings. It combines the rigor of clinician-rated scales with the accessibility of self-report tools. This demonstrates the feasibility of LLMs for scalable, low-cost psychiatric assessment. Future work should address contextual limitations and explore multimodal integration.</div></div>","PeriodicalId":8543,"journal":{"name":"Asian journal of psychiatry","volume":"113 ","pages":"Article 104707"},"PeriodicalIF":4.5000,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Asian journal of psychiatry","FirstCategoryId":"3","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1876201825003508","RegionNum":4,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PSYCHIATRY","Score":null,"Total":0}
引用次数: 0
Abstract
Background
The Hamilton Depression Rating Scale (HAMD) is the gold standard for assessing depression but requires clinician administration, limiting its accessibility. Large language models (LLMs) offer potential for automated, valid assessments. We developed HAMLET (Hamilton Large-language-model Evaluation Tool), an interactive LLM-based tool designed to replicate the HAMD-17, and compared its results with those obtained by a psychiatrist.
Methods
HAMLET utilizes Qwen-Max in the temperature of 0.7 and a top-p value of 0.6, guided by structured prompts engineered via clinician-supervised tuning. 60 patients with Major Depressive Disorder completed: (1) HAMLET, (2) clinician-rated HAMD-17, and (3) PHQ-9. Agreement was assessed using Intraclass Correlation Coefficient (ICC), Bland-Altman plots, and Gwet’s AC2. Correlations with HAMD and incremental validity beyond the PHQ-9 were evaluated.
Results
HAMLET demonstrated strong overall agreement with clinician-rated HAMD scores (ICC=0.911; 95 % CI: 0.855–0.946). It outperformed the PHQ-9 in correlating with HAMD scores (r = 0.92 vs 0.79; Steiger’s Z = 3.798, p < 0.001) and demonstrated incremental validity (ΔR²=0.252). Item-level agreement (Gwet’s AC2) was > 0.60 for all items, though was lower for sensitive questions.
Conclusion
HAMLET is the first LLM framework to autonomously conduct HAMD-17 assessments demonstrating substantial agreement with clinician ratings. It combines the rigor of clinician-rated scales with the accessibility of self-report tools. This demonstrates the feasibility of LLMs for scalable, low-cost psychiatric assessment. Future work should address contextual limitations and explore multimodal integration.
期刊介绍:
The Asian Journal of Psychiatry serves as a comprehensive resource for psychiatrists, mental health clinicians, neurologists, physicians, mental health students, and policymakers. Its goal is to facilitate the exchange of research findings and clinical practices between Asia and the global community. The journal focuses on psychiatric research relevant to Asia, covering preclinical, clinical, service system, and policy development topics. It also highlights the socio-cultural diversity of the region in relation to mental health.