Psychometric Evaluation of Large Language Model Embeddings for Personality Trait Prediction.

IF 5.8 2区医学 Q1 HEALTH CARE SCIENCES & SERVICES

Journal of Medical Internet Research Pub Date : 2025-07-08 DOI:10.2196/75347

Julina Maharjan, Ruoming Jin, Jianfeng Zhu, Deric Kenne

{"title":"Psychometric Evaluation of Large Language Model Embeddings for Personality Trait Prediction.","authors":"Julina Maharjan, Ruoming Jin, Jianfeng Zhu, Deric Kenne","doi":"10.2196/75347","DOIUrl":null,"url":null,"abstract":"Background: Recent advancements in large language models (LLMs) have generated significant interest in their potential for assessing psychological constructs, particularly personality traits. While prior research has explored LLMs' capabilities in zero-shot or few-shot personality inference, few studies have systematically evaluated LLM embeddings within a psychometric validity framework or examined their correlations with linguistic and emotional markers. Additionally, the comparative efficacy of LLM embeddings against traditional feature engineering methods remains underexplored, leaving gaps in understanding their scalability and interpretability for computational personality assessment.Objective: This study evaluates LLM embeddings for personality trait prediction through four key analyses: (1) performance comparison with zero-shot methods on PANDORA Reddit data, (2) psychometric validation and correlation with LIWC (Linguistic Inquiry and Word Count) and emotion features, (3) benchmarking against traditional feature engineering approaches, and (4) assessment of model size effects (OpenAI vs BERT vs RoBERTa). We aim to establish LLM embeddings as a psychometrically valid and efficient alternative for personality assessment.Methods: We conducted a multistage analysis using 1 million Reddit posts from the PANDORA Big Five personality dataset. First, we generated text embeddings using 3 LLM architectures (RoBERTa, BERT, and OpenAI) and trained a custom bidirectional long short-term memory model for personality prediction. We compared this approach against zero-shot inference using prompt-based methods. Second, we extracted psycholinguistic features (LIWC categories and National Research Council emotions) and performed feature engineering to evaluate potential performance enhancements. Third, we assessed the psychometric validity of LLM embeddings: reliability validity using Cronbach α and convergent validity analysis by examining correlations between embeddings and established linguistic markers. Finally, we performed traditional feature engineering on static psycholinguistic features to assess performance under different settings.Results: LLM embeddings trained using simple deep learning techniques significantly outperform zero-shot approaches on average by 45% across all personality traits. Although psychometric validation tests indicate moderate reliability, with an average Cronbach α of 0.63, correlation analyses spark a strong association with key linguistic or emotional markers; openness correlates highly with social (r=0.53), conscientiousness with linguistic (r=0.46), extraversion with social (r=0.41), agreeableness with pronoun usage (r=0.40), and neuroticism with politics-related text (r=0.63). Despite adding advanced feature engineering on linguistic features, the performance did not improve, suggesting that LLM embeddings inherently capture key linguistic features. Furthermore, our analyses demonstrated efficacy on larger model size with a computational cost trade-off.Conclusions: Our findings demonstrate that LLM embeddings offer a robust alternative to zero-shot methods in personality trait analysis, capturing key linguistic patterns without requiring extensive feature engineering. The correlation between established psycholinguistic markers and the performance trade-off with computational cost provides a hint for future computational linguistic work targeting LLM for personality assessment. Further research should explore fine-tuning strategies to enhance psychometric validity.","PeriodicalId":16337,"journal":{"name":"Journal of Medical Internet Research","volume":"27 ","pages":"e75347"},"PeriodicalIF":5.8000,"publicationDate":"2025-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12262148/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Medical Internet Research","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.2196/75347","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Background: Recent advancements in large language models (LLMs) have generated significant interest in their potential for assessing psychological constructs, particularly personality traits. While prior research has explored LLMs' capabilities in zero-shot or few-shot personality inference, few studies have systematically evaluated LLM embeddings within a psychometric validity framework or examined their correlations with linguistic and emotional markers. Additionally, the comparative efficacy of LLM embeddings against traditional feature engineering methods remains underexplored, leaving gaps in understanding their scalability and interpretability for computational personality assessment.

Objective: This study evaluates LLM embeddings for personality trait prediction through four key analyses: (1) performance comparison with zero-shot methods on PANDORA Reddit data, (2) psychometric validation and correlation with LIWC (Linguistic Inquiry and Word Count) and emotion features, (3) benchmarking against traditional feature engineering approaches, and (4) assessment of model size effects (OpenAI vs BERT vs RoBERTa). We aim to establish LLM embeddings as a psychometrically valid and efficient alternative for personality assessment.

Methods: We conducted a multistage analysis using 1 million Reddit posts from the PANDORA Big Five personality dataset. First, we generated text embeddings using 3 LLM architectures (RoBERTa, BERT, and OpenAI) and trained a custom bidirectional long short-term memory model for personality prediction. We compared this approach against zero-shot inference using prompt-based methods. Second, we extracted psycholinguistic features (LIWC categories and National Research Council emotions) and performed feature engineering to evaluate potential performance enhancements. Third, we assessed the psychometric validity of LLM embeddings: reliability validity using Cronbach α and convergent validity analysis by examining correlations between embeddings and established linguistic markers. Finally, we performed traditional feature engineering on static psycholinguistic features to assess performance under different settings.

Results: LLM embeddings trained using simple deep learning techniques significantly outperform zero-shot approaches on average by 45% across all personality traits. Although psychometric validation tests indicate moderate reliability, with an average Cronbach α of 0.63, correlation analyses spark a strong association with key linguistic or emotional markers; openness correlates highly with social (r=0.53), conscientiousness with linguistic (r=0.46), extraversion with social (r=0.41), agreeableness with pronoun usage (r=0.40), and neuroticism with politics-related text (r=0.63). Despite adding advanced feature engineering on linguistic features, the performance did not improve, suggesting that LLM embeddings inherently capture key linguistic features. Furthermore, our analyses demonstrated efficacy on larger model size with a computational cost trade-off.

Conclusions: Our findings demonstrate that LLM embeddings offer a robust alternative to zero-shot methods in personality trait analysis, capturing key linguistic patterns without requiring extensive feature engineering. The correlation between established psycholinguistic markers and the performance trade-off with computational cost provides a hint for future computational linguistic work targeting LLM for personality assessment. Further research should explore fine-tuning strategies to enhance psychometric validity.

查看原文本刊更多论文

大型语言模型嵌入对人格特质预测的心理测量评价。

背景：大型语言模型（llm）的最新进展引起了人们对其评估心理结构，特别是人格特征的潜力的极大兴趣。虽然之前的研究已经探索了LLM在零概率或少概率人格推理方面的能力，但很少有研究在心理测量效度框架内系统地评估LLM嵌入，或检查它们与语言和情感标记的相关性。此外，LLM嵌入与传统特征工程方法的比较效果仍未得到充分探索，在理解其可扩展性和可解释性方面存在差距。目的：本研究通过四个关键分析来评估LLM嵌入对人格特质预测的影响：(1)与PANDORA Reddit数据上的零射击方法的性能比较，(2)心理测量验证以及与LIWC（语言查询和字数统计）和情感特征的相关性，(3)与传统特征工程方法的基准测试，以及(4)模型尺寸效应评估（OpenAI vs BERT vs RoBERTa）。我们的目标是建立法学硕士嵌入作为一个心理计量学上有效和有效的替代人格评估。方法：我们对来自PANDORA大五人格数据集的100万篇Reddit帖子进行了多阶段分析。首先，我们使用3个LLM架构（RoBERTa、BERT和OpenAI）生成文本嵌入，并训练了一个用于人格预测的定制双向长短期记忆模型。我们将这种方法与基于提示的零射击推理方法进行了比较。其次，我们提取心理语言特征（LIWC类别和国家研究委员会情绪）并进行特征工程来评估潜在的性能增强。第三，我们评估了LLM嵌入的心理测量效度：信度效度采用Cronbach α和收敛效度分析，通过检验嵌入与既定语言标记之间的相关性。最后，我们对静态心理语言特征进行了传统的特征工程，以评估不同设置下的表现。结果：使用简单深度学习技术训练的LLM嵌入在所有人格特征上的平均表现明显优于零射击方法45%。虽然心理测量验证测试显示适度的信度，平均Cronbach α为0.63，但相关分析激发了与关键语言或情感标记的强烈关联；开放性与社交相关（r=0.53），严谨性与语言相关（r=0.46），外向性与社交相关（r=0.41），亲和性与代词使用相关（r=0.40），神经质与政治相关文本相关（r=0.63）。尽管在语言特征上添加了高级特征工程，但性能并没有提高，这表明LLM嵌入固有地捕获了关键的语言特征。此外，我们的分析证明了在计算成本权衡的情况下，更大的模型尺寸的有效性。结论：我们的研究结果表明，LLM嵌入在人格特征分析中提供了一种强大的替代方法，可以在不需要大量特征工程的情况下捕获关键的语言模式。已建立的心理语言标记与计算成本的性能权衡之间的相关性为未来针对法学硕士人格评估的计算语言学工作提供了线索。进一步的研究应探索微调策略，以提高心理测量效度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Medical Internet Research 医学-卫生保健

CiteScore

14.40

自引率

5.40%

发文量

654

审稿时长

1 months

期刊介绍： The Journal of Medical Internet Research (JMIR) is a highly respected publication in the field of health informatics and health services. With a founding date in 1999, JMIR has been a pioneer in the field for over two decades. As a leader in the industry, the journal focuses on digital health, data science, health informatics, and emerging technologies for health, medicine, and biomedical research. It is recognized as a top publication in these disciplines, ranking in the first quartile (Q1) by Impact Factor. Notably, JMIR holds the prestigious position of being ranked #1 on Google Scholar within the "Medical Informatics" discipline.