Hartono , Muhammad Khahfi Zuhanda , Rahmad Syah , Fikriyah Iftinan Fauzi , Nini Sri Wahyuni , Istiana , Eva Yulina
{"title":"An adaptive clustering framework for personality prediction using enhanced seed optimization","authors":"Hartono , Muhammad Khahfi Zuhanda , Rahmad Syah , Fikriyah Iftinan Fauzi , Nini Sri Wahyuni , Istiana , Eva Yulina","doi":"10.1016/j.dajour.2025.100630","DOIUrl":null,"url":null,"abstract":"<div><div>Personality prediction has become an increasingly important area in psychological computing and human-centered AI, especially with the rise of user-generated textual data from social media platforms. However, current approaches – primarily based on supervised learning – face major challenges in dealing with class imbalance, noisy inputs, and poor generalization in real-world scenarios. This study introduces an adaptive hybrid clustering framework for MBTI-based personality prediction by integrating K-Means with Nearest Neighbor Density Peak (K-NNDP) and Determinantal Point Process (DPP) to enhance seed optimization. The framework addresses key limitations of traditional clustering methods – such as poor class imbalance handling, lack of diversity, and outlier sensitivity – by combining density-based refinement with probabilistic, diversity-driven seed selection. Applied to the MBTI Kaggle dataset of 8,675 instances, the model transforms unstructured text into numerical vectors using TF-IDF, Bag-of-Words, and GloVe embeddings. Experimental results show that the proposed method outperforms six established supervised models – Decision Trees, KNN, Logistic Regression, LSVC, SGD, and XGBoost – across all multi-label classification metrics, achieving the highest Exact Match Ratio (0.813), Accuracy (0.915), Precision (0.878), Recall (0.897), and F1-Score (0.887), while significantly reducing Hamming Loss (0.103) and Zero-One Loss (0.187). Sensitivity analyses under varying imbalance ratios (up to 1:20), increasing textual noise, and data diversity levels further validate the model’s robustness and generalizability, even in challenging conditions. These findings confirm the effectiveness of the proposed unsupervised approach in uncovering coherent personality clusters without requiring labeled data. Nonetheless, further improvements are needed in enhancing cluster interpretability and optimizing runtime performance. Future research will explore real-time implementation and integration into personality-aware systems</div></div>","PeriodicalId":100357,"journal":{"name":"Decision Analytics Journal","volume":"16 ","pages":"Article 100630"},"PeriodicalIF":0.0000,"publicationDate":"2025-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Decision Analytics Journal","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772662225000864","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Personality prediction has become an increasingly important area in psychological computing and human-centered AI, especially with the rise of user-generated textual data from social media platforms. However, current approaches – primarily based on supervised learning – face major challenges in dealing with class imbalance, noisy inputs, and poor generalization in real-world scenarios. This study introduces an adaptive hybrid clustering framework for MBTI-based personality prediction by integrating K-Means with Nearest Neighbor Density Peak (K-NNDP) and Determinantal Point Process (DPP) to enhance seed optimization. The framework addresses key limitations of traditional clustering methods – such as poor class imbalance handling, lack of diversity, and outlier sensitivity – by combining density-based refinement with probabilistic, diversity-driven seed selection. Applied to the MBTI Kaggle dataset of 8,675 instances, the model transforms unstructured text into numerical vectors using TF-IDF, Bag-of-Words, and GloVe embeddings. Experimental results show that the proposed method outperforms six established supervised models – Decision Trees, KNN, Logistic Regression, LSVC, SGD, and XGBoost – across all multi-label classification metrics, achieving the highest Exact Match Ratio (0.813), Accuracy (0.915), Precision (0.878), Recall (0.897), and F1-Score (0.887), while significantly reducing Hamming Loss (0.103) and Zero-One Loss (0.187). Sensitivity analyses under varying imbalance ratios (up to 1:20), increasing textual noise, and data diversity levels further validate the model’s robustness and generalizability, even in challenging conditions. These findings confirm the effectiveness of the proposed unsupervised approach in uncovering coherent personality clusters without requiring labeled data. Nonetheless, further improvements are needed in enhancing cluster interpretability and optimizing runtime performance. Future research will explore real-time implementation and integration into personality-aware systems