An adaptive clustering framework for personality prediction using enhanced seed optimization

Decision Analytics Journal Pub Date : 2025-09-01 DOI:10.1016/j.dajour.2025.100630

Hartono , Muhammad Khahfi Zuhanda , Rahmad Syah , Fikriyah Iftinan Fauzi , Nini Sri Wahyuni , Istiana , Eva Yulina

{"title":"An adaptive clustering framework for personality prediction using enhanced seed optimization","authors":"Hartono , Muhammad Khahfi Zuhanda , Rahmad Syah , Fikriyah Iftinan Fauzi , Nini Sri Wahyuni , Istiana , Eva Yulina","doi":"10.1016/j.dajour.2025.100630","DOIUrl":null,"url":null,"abstract":"<div><div>Personality prediction has become an increasingly important area in psychological computing and human-centered AI, especially with the rise of user-generated textual data from social media platforms. However, current approaches – primarily based on supervised learning – face major challenges in dealing with class imbalance, noisy inputs, and poor generalization in real-world scenarios. This study introduces an adaptive hybrid clustering framework for MBTI-based personality prediction by integrating K-Means with Nearest Neighbor Density Peak (K-NNDP) and Determinantal Point Process (DPP) to enhance seed optimization. The framework addresses key limitations of traditional clustering methods – such as poor class imbalance handling, lack of diversity, and outlier sensitivity – by combining density-based refinement with probabilistic, diversity-driven seed selection. Applied to the MBTI Kaggle dataset of 8,675 instances, the model transforms unstructured text into numerical vectors using TF-IDF, Bag-of-Words, and GloVe embeddings. Experimental results show that the proposed method outperforms six established supervised models – Decision Trees, KNN, Logistic Regression, LSVC, SGD, and XGBoost – across all multi-label classification metrics, achieving the highest Exact Match Ratio (0.813), Accuracy (0.915), Precision (0.878), Recall (0.897), and F1-Score (0.887), while significantly reducing Hamming Loss (0.103) and Zero-One Loss (0.187). Sensitivity analyses under varying imbalance ratios (up to 1:20), increasing textual noise, and data diversity levels further validate the model’s robustness and generalizability, even in challenging conditions. These findings confirm the effectiveness of the proposed unsupervised approach in uncovering coherent personality clusters without requiring labeled data. Nonetheless, further improvements are needed in enhancing cluster interpretability and optimizing runtime performance. Future research will explore real-time implementation and integration into personality-aware systems</div></div>","PeriodicalId":100357,"journal":{"name":"Decision Analytics Journal","volume":"16 ","pages":"Article 100630"},"PeriodicalIF":0.0000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Decision Analytics Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772662225000864","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Personality prediction has become an increasingly important area in psychological computing and human-centered AI, especially with the rise of user-generated textual data from social media platforms. However, current approaches – primarily based on supervised learning – face major challenges in dealing with class imbalance, noisy inputs, and poor generalization in real-world scenarios. This study introduces an adaptive hybrid clustering framework for MBTI-based personality prediction by integrating K-Means with Nearest Neighbor Density Peak (K-NNDP) and Determinantal Point Process (DPP) to enhance seed optimization. The framework addresses key limitations of traditional clustering methods – such as poor class imbalance handling, lack of diversity, and outlier sensitivity – by combining density-based refinement with probabilistic, diversity-driven seed selection. Applied to the MBTI Kaggle dataset of 8,675 instances, the model transforms unstructured text into numerical vectors using TF-IDF, Bag-of-Words, and GloVe embeddings. Experimental results show that the proposed method outperforms six established supervised models – Decision Trees, KNN, Logistic Regression, LSVC, SGD, and XGBoost – across all multi-label classification metrics, achieving the highest Exact Match Ratio (0.813), Accuracy (0.915), Precision (0.878), Recall (0.897), and F1-Score (0.887), while significantly reducing Hamming Loss (0.103) and Zero-One Loss (0.187). Sensitivity analyses under varying imbalance ratios (up to 1:20), increasing textual noise, and data diversity levels further validate the model’s robustness and generalizability, even in challenging conditions. These findings confirm the effectiveness of the proposed unsupervised approach in uncovering coherent personality clusters without requiring labeled data. Nonetheless, further improvements are needed in enhancing cluster interpretability and optimizing runtime performance. Future research will explore real-time implementation and integration into personality-aware systems

查看原文本刊更多论文

基于增强种子优化的人格预测自适应聚类框架

个性预测已经成为心理计算和以人为本的人工智能中越来越重要的领域，尤其是随着社交媒体平台上用户生成文本数据的兴起。然而，目前的方法——主要基于监督学习——在处理现实场景中的类不平衡、噪声输入和不良泛化方面面临着重大挑战。本文提出了一种基于mbti的自适应混合聚类框架，将K-Means与最近邻密度峰（K-NNDP）和确定性点过程（DPP）相结合，增强种子优化。该框架通过将基于密度的精化与概率、多样性驱动的种子选择相结合，解决了传统聚类方法的关键局限性——比如糟糕的类不平衡处理、缺乏多样性和离群值敏感性。该模型应用于8675个实例的MBTI Kaggle数据集，使用TF-IDF、Bag-of-Words和GloVe嵌入将非结构化文本转换为数值向量。实验结果表明，该方法在所有多标签分类指标上都优于六种已建立的监督模型——决策树、KNN、Logistic回归、LSVC、SGD和XGBoost，实现了最高的精确匹配率（0.813）、准确率（0.915）、精度（0.878）、召回率（0.897）和F1-Score(0.887)，同时显著降低了汉明损失（0.103）和零损失（0.187）。在不同的不平衡比例（高达1:20）、不断增加的文本噪声和数据多样性水平下的敏感性分析进一步验证了模型的鲁棒性和泛化性，即使在具有挑战性的条件下也是如此。这些发现证实了所提出的无监督方法在不需要标记数据的情况下发现连贯人格集群的有效性。尽管如此，在增强集群可解释性和优化运行时性能方面还需要进一步的改进。未来的研究将探索实时实现和集成到个性感知系统中

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Decision Analytics Journal

CiteScore

3.90

自引率

0.00%

发文量