印度尼西亚巴哈萨使用Twitter信息对DISC人格分类进行监督学习和重新采样技术

IF 4.9 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

Applied Computing and Informatics Pub Date : 2021-09-07 DOI:10.1108/aci-03-2021-0054

Ema Utami, Irwan Oyong, Suwanto Raharjo, Anggit Dwi Hartanto, Sumarni Adi

{"title":"印度尼西亚巴哈萨使用Twitter信息对DISC人格分类进行监督学习和重新采样技术","authors":"Ema Utami, Irwan Oyong, Suwanto Raharjo, Anggit Dwi Hartanto, Sumarni Adi","doi":"10.1108/aci-03-2021-0054","DOIUrl":null,"url":null,"abstract":"PurposeGathering knowledge regarding personality traits has long been the interest of academics and researchers in the fields of psychology and in computer science. Analyzing profile data from personal social media accounts reduces data collection time, as this method does not require users to fill any questionnaires. A pure natural language processing (NLP) approach can give decent results, and its reliability can be improved by combining it with machine learning (as shown by previous studies).Design/methodology/approachIn this, cleaning the dataset and extracting relevant potential features “as assessed by psychological experts” are essential, as Indonesians tend to mix formal words, non-formal words, slang and abbreviations when writing social media posts. For this article, raw data were derived from a predefined dominance, influence, stability and conscientious (DISC) quiz website, returning 316,967 tweets from 1,244 Twitter accounts “filtered to include only personal and Indonesian-language accounts”. Using a combination of NLP techniques and machine learning, the authors aim to develop a better approach and more robust model, especially for the Indonesian language.FindingsThe authors find that employing a SMOTETomek re-sampling technique and hyperparameter tuning boosts the model’s performance on formalized datasets by 57% (as measured through the F1-score).Originality/valueThe process of cleaning dataset and extracting relevant potential features assessed by psychological experts from it are essential because Indonesian people tend to mix formal words, non-formal words, slang words and abbreviations when writing tweets. Organic data derived from a predefined DISC quiz website resulting 1244 records of Twitter accounts and 316.967 tweets.","PeriodicalId":37348,"journal":{"name":"Applied Computing and Informatics","volume":" ","pages":""},"PeriodicalIF":4.9000,"publicationDate":"2021-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Supervised learning and resampling techniques on DISC personality classification using Twitter information in Bahasa Indonesia\",\"authors\":\"Ema Utami, Irwan Oyong, Suwanto Raharjo, Anggit Dwi Hartanto, Sumarni Adi\",\"doi\":\"10.1108/aci-03-2021-0054\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"PurposeGathering knowledge regarding personality traits has long been the interest of academics and researchers in the fields of psychology and in computer science. Analyzing profile data from personal social media accounts reduces data collection time, as this method does not require users to fill any questionnaires. A pure natural language processing (NLP) approach can give decent results, and its reliability can be improved by combining it with machine learning (as shown by previous studies).Design/methodology/approachIn this, cleaning the dataset and extracting relevant potential features “as assessed by psychological experts” are essential, as Indonesians tend to mix formal words, non-formal words, slang and abbreviations when writing social media posts. For this article, raw data were derived from a predefined dominance, influence, stability and conscientious (DISC) quiz website, returning 316,967 tweets from 1,244 Twitter accounts “filtered to include only personal and Indonesian-language accounts”. Using a combination of NLP techniques and machine learning, the authors aim to develop a better approach and more robust model, especially for the Indonesian language.FindingsThe authors find that employing a SMOTETomek re-sampling technique and hyperparameter tuning boosts the model’s performance on formalized datasets by 57% (as measured through the F1-score).Originality/valueThe process of cleaning dataset and extracting relevant potential features assessed by psychological experts from it are essential because Indonesian people tend to mix formal words, non-formal words, slang words and abbreviations when writing tweets. Organic data derived from a predefined DISC quiz website resulting 1244 records of Twitter accounts and 316.967 tweets.\",\"PeriodicalId\":37348,\"journal\":{\"name\":\"Applied Computing and Informatics\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":4.9000,\"publicationDate\":\"2021-09-07\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Computing and Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1108/aci-03-2021-0054\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Computing and Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1108/aci-03-2021-0054","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 4

摘要

目的收集有关人格特征的知识一直是心理学和计算机科学领域的学者和研究人员的兴趣所在。分析个人社交媒体账户的个人资料数据可以减少数据收集时间，因为这种方法不需要用户填写任何问卷。纯自然语言处理（NLP）方法可以给出不错的结果，并且可以通过将其与机器学习相结合来提高其可靠性（如先前的研究所示）。设计/方法论/方法在这方面，清理数据集并提取“心理专家评估的”相关潜在特征是至关重要的，因为印尼人倾向于将正式单词、非正式单词，写社交媒体帖子时的俚语和缩写。在这篇文章中，原始数据来自一个预先定义的主导地位、影响力、稳定性和良心（DISC）测试网站，从1244个推特账户返回316967条推文，“经过过滤，只包括个人和印尼语账户”。作者将NLP技术和机器学习相结合，旨在开发一种更好的方法和更稳健的模型，尤其是针对印尼语。发现作者发现，采用SMOTETomek重新采样技术和超参数调整可以将模型在形式化数据集上的性能提高57%（通过F1分数衡量）。独创性/价值清理数据集并从中提取心理专家评估的相关潜在特征的过程至关重要，因为印尼人倾向于混合形式化单词，写推文时使用的非正式单词、俚语和缩写。来自预定义DISC测验网站的有机数据产生了1244条推特账户记录和316.967条推文。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Supervised learning and resampling techniques on DISC personality classification using Twitter information in Bahasa Indonesia

PurposeGathering knowledge regarding personality traits has long been the interest of academics and researchers in the fields of psychology and in computer science. Analyzing profile data from personal social media accounts reduces data collection time, as this method does not require users to fill any questionnaires. A pure natural language processing (NLP) approach can give decent results, and its reliability can be improved by combining it with machine learning (as shown by previous studies).Design/methodology/approachIn this, cleaning the dataset and extracting relevant potential features “as assessed by psychological experts” are essential, as Indonesians tend to mix formal words, non-formal words, slang and abbreviations when writing social media posts. For this article, raw data were derived from a predefined dominance, influence, stability and conscientious (DISC) quiz website, returning 316,967 tweets from 1,244 Twitter accounts “filtered to include only personal and Indonesian-language accounts”. Using a combination of NLP techniques and machine learning, the authors aim to develop a better approach and more robust model, especially for the Indonesian language.FindingsThe authors find that employing a SMOTETomek re-sampling technique and hyperparameter tuning boosts the model’s performance on formalized datasets by 57% (as measured through the F1-score).Originality/valueThe process of cleaning dataset and extracting relevant potential features assessed by psychological experts from it are essential because Indonesian people tend to mix formal words, non-formal words, slang words and abbreviations when writing tweets. Organic data derived from a predefined DISC quiz website resulting 1244 records of Twitter accounts and 316.967 tweets.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Applied Computing and Informatics Computer Science-Information Systems

CiteScore

12.20

自引率

0.00%

发文量

审稿时长

39 weeks

期刊介绍： Applied Computing and Informatics aims to be timely in disseminating leading-edge knowledge to researchers, practitioners and academics whose interest is in the latest developments in applied computing and information systems concepts, strategies, practices, tools and technologies. In particular, the journal encourages research studies that have significant contributions to make to the continuous development and improvement of IT practices in the Kingdom of Saudi Arabia and other countries. By doing so, the journal attempts to bridge the gap between the academic and industrial community, and therefore, welcomes theoretically grounded, methodologically sound research studies that address various IT-related problems and innovations of an applied nature. The journal will serve as a forum for practitioners, researchers, managers and IT policy makers to share their knowledge and experience in the design, development, implementation, management and evaluation of various IT applications. Contributions may deal with, but are not limited to: • Internet and E-Commerce Architecture, Infrastructure, Models, Deployment Strategies and Methodologies. • E-Business and E-Government Adoption. • Mobile Commerce and their Applications. • Applied Telecommunication Networks. • Software Engineering Approaches, Methodologies, Techniques, and Tools. • Applied Data Mining and Warehousing. • Information Strategic Planning and Recourse Management. • Applied Wireless Computing. • Enterprise Resource Planning Systems. • IT Education. • Societal, Cultural, and Ethical Issues of IT. • Policy, Legal and Global Issues of IT. • Enterprise Database Technology.