Navigating pathways to automated personality prediction: a comparative study of small and medium language models.

IF 2.4 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

Frontiers in Big Data Pub Date : 2024-09-13 eCollection Date: 2024-01-01 DOI:10.3389/fdata.2024.1387325

Fatima Habib, Zeeshan Ali, Akbar Azam, Komal Kamran, Fahad Mansoor Pasha

{"title":"Navigating pathways to automated personality prediction: a comparative study of small and medium language models.","authors":"Fatima Habib, Zeeshan Ali, Akbar Azam, Komal Kamran, Fahad Mansoor Pasha","doi":"10.3389/fdata.2024.1387325","DOIUrl":null,"url":null,"abstract":"Introduction: Recent advancements in Natural Language Processing (NLP) and widely available social media data have made it possible to predict human personalities in various computational applications. In this context, pre-trained Large Language Models (LLMs) have gained recognition for their exceptional performance in NLP benchmarks. However, these models require substantial computational resources, escalating their carbon and water footprint. Consequently, a shift toward more computationally efficient smaller models is observed.Methods: This study compares a small model ALBERT (11.8M parameters) with a larger model, RoBERTa (125M parameters) in predicting big five personality traits. It utilizes the PANDORA dataset comprising Reddit comments, processing them on a Tesla P100-PCIE-16GB GPU. The study customized both models to support multi-output regression and added two linear layers for fine-grained regression analysis.Results: Results are evaluated on Mean Squared Error (MSE) and Root Mean Squared Error (RMSE), considering the computational resources consumed during training. While ALBERT consumed lower levels of system memory with lower heat emission, it took higher computation time compared to RoBERTa. The study produced comparable levels of MSE, RMSE, and training loss reduction.Discussion: This highlights the influence of training data quality on the model's performance, outweighing the significance of model size. Theoretical and practical implications are also discussed.","PeriodicalId":52859,"journal":{"name":"Frontiers in Big Data","volume":"7 ","pages":"1387325"},"PeriodicalIF":2.4000,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11427259/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Frontiers in Big Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3389/fdata.2024.1387325","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/1/1 0:00:00","PubModel":"eCollection","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Introduction: Recent advancements in Natural Language Processing (NLP) and widely available social media data have made it possible to predict human personalities in various computational applications. In this context, pre-trained Large Language Models (LLMs) have gained recognition for their exceptional performance in NLP benchmarks. However, these models require substantial computational resources, escalating their carbon and water footprint. Consequently, a shift toward more computationally efficient smaller models is observed.

Methods: This study compares a small model ALBERT (11.8M parameters) with a larger model, RoBERTa (125M parameters) in predicting big five personality traits. It utilizes the PANDORA dataset comprising Reddit comments, processing them on a Tesla P100-PCIE-16GB GPU. The study customized both models to support multi-output regression and added two linear layers for fine-grained regression analysis.

Results: Results are evaluated on Mean Squared Error (MSE) and Root Mean Squared Error (RMSE), considering the computational resources consumed during training. While ALBERT consumed lower levels of system memory with lower heat emission, it took higher computation time compared to RoBERTa. The study produced comparable levels of MSE, RMSE, and training loss reduction.

Discussion: This highlights the influence of training data quality on the model's performance, outweighing the significance of model size. Theoretical and practical implications are also discussed.

查看原文本刊更多论文

通往自动人格预测之路：中小型语言模型的比较研究。

导言：自然语言处理（NLP）领域的最新进展和广泛可用的社交媒体数据使得在各种计算应用中预测人类性格成为可能。在这种情况下，预训练的大型语言模型（LLM）因其在 NLP 基准测试中的优异表现而获得了认可。然而，这些模型需要大量的计算资源，从而增加了碳足迹和水足迹。因此，人们开始转向计算效率更高的小型模型：本研究比较了小型模型 ALBERT（1180 万个参数）和大型模型 RoBERTa（1.25 亿个参数）在预测五大人格特质方面的效果。研究利用了由 Reddit 评论组成的 PANDORA 数据集，并在 Tesla P100-PCIE-16GB GPU 上进行了处理。研究对这两个模型进行了定制，以支持多输出回归，并添加了两个线性层进行细粒度回归分析：结果：根据平均平方误差（MSE）和均方根误差（RMSE）对结果进行了评估，同时考虑了训练过程中消耗的计算资源。与 RoBERTa 相比，ALBERT 消耗的系统内存更少，发热量更低，但计算时间更长。这项研究产生了相当水平的 MSE、RMSE 和训练损失降低率：这凸显了训练数据质量对模型性能的影响，其重要性超过了模型大小。此外，还讨论了理论和实践意义。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊