Creating a New Dataset for the Classification of Cyber Bullying

Advances in Artificial Intelligence Research Pub Date : 2023-10-29 DOI:10.54569/aair.1206144

Çilem KOÇAK, Tuncay YİĞİT, Mehmet BİLEN

{"title":"Creating a New Dataset for the Classification of Cyber Bullying","authors":"Çilem KOÇAK, Tuncay YİĞİT, Mehmet BİLEN","doi":"10.54569/aair.1206144","DOIUrl":null,"url":null,"abstract":"Regardless of young or old, people have quickly stepped into the world of internet with today's communication technologies such as phones, tablets, computers and smart devices. As the place of the Internet in people's lives increases, social media platforms are diversifying and users want to take part in these platforms. With the increase in the number of social media users, some negativities are encountered. The most important problem encountered in social media platforms is cyber bullying. Although cyber bullying seems to be a daily dialogue between social media users or between groups, the situation of encountering is increasing day by day with the diversity of shared information, content and agenda social media environments. With the development of technology, it is necessary to develop a platform that detects bullying with artificial intelligence technologies. One of the biggest difficulties in text classification problems that we encounter during the development of these platforms is the need to train the artificial intelligence algorithm to be used with labeled data. In this study, 21 different people, including journalists, athletes, scientists, doctors, politicians, comedians, social media phenomena, and artists who actively use social media, were selected in order to create the necessary dataset for training the models to be developed to detect cyber bullying situations. The public messages (mentions) of these 21 people sent via Twitter were compiled. After filtering the repetitive and meaningless messages sent by bot accounts out of 10500 tweets compiled, the number of messages in the dataset decreased to 7706. The labeling process, which is necessary for the dataset to be used for training and testing purposes in classification processes, was carried out by three independent people who were given preliminary information about cyberbullying (1=Includes Cyber bullying, 0=Does not include Cyber bullying). The majority of the tags, which were read and assigned by 3 different people, were accepted as the final class of the relevant message. Afterwards, the dataset was preprocessed in accordance with the principles of natural language processing and made suitable for classification algorithms. The findings obtained after the classification processes performed with the basic classification algorithms are shared. When the findings are examined, it is understood that the data set created has the competence to be used in the detection and prevention of cyber bullying. In this context, it is predicted that training specially developed and optimized artificial intelligence algorithms with the relevant dataset for the detection of cyberbullying will greatly increase the success rate.","PeriodicalId":286492,"journal":{"name":"Advances in Artificial Intelligence Research","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Advances in Artificial Intelligence Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.54569/aair.1206144","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Regardless of young or old, people have quickly stepped into the world of internet with today's communication technologies such as phones, tablets, computers and smart devices. As the place of the Internet in people's lives increases, social media platforms are diversifying and users want to take part in these platforms. With the increase in the number of social media users, some negativities are encountered. The most important problem encountered in social media platforms is cyber bullying. Although cyber bullying seems to be a daily dialogue between social media users or between groups, the situation of encountering is increasing day by day with the diversity of shared information, content and agenda social media environments. With the development of technology, it is necessary to develop a platform that detects bullying with artificial intelligence technologies. One of the biggest difficulties in text classification problems that we encounter during the development of these platforms is the need to train the artificial intelligence algorithm to be used with labeled data. In this study, 21 different people, including journalists, athletes, scientists, doctors, politicians, comedians, social media phenomena, and artists who actively use social media, were selected in order to create the necessary dataset for training the models to be developed to detect cyber bullying situations. The public messages (mentions) of these 21 people sent via Twitter were compiled. After filtering the repetitive and meaningless messages sent by bot accounts out of 10500 tweets compiled, the number of messages in the dataset decreased to 7706. The labeling process, which is necessary for the dataset to be used for training and testing purposes in classification processes, was carried out by three independent people who were given preliminary information about cyberbullying (1=Includes Cyber bullying, 0=Does not include Cyber bullying). The majority of the tags, which were read and assigned by 3 different people, were accepted as the final class of the relevant message. Afterwards, the dataset was preprocessed in accordance with the principles of natural language processing and made suitable for classification algorithms. The findings obtained after the classification processes performed with the basic classification algorithms are shared. When the findings are examined, it is understood that the data set created has the competence to be used in the detection and prevention of cyber bullying. In this context, it is predicted that training specially developed and optimized artificial intelligence algorithms with the relevant dataset for the detection of cyberbullying will greatly increase the success rate.

查看原文本刊更多论文

为网络欺凌分类创建一个新的数据集

无论是年轻人还是老年人，随着今天的通信技术，如手机、平板电脑、电脑和智能设备，人们已经迅速进入了互联网的世界。随着互联网在人们生活中的地位越来越高，社交媒体平台也越来越多样化，用户也希望参与到这些平台中来。随着社交媒体用户数量的增加，也遇到了一些负面影响。在社交媒体平台上遇到的最重要的问题是网络欺凌。虽然网络欺凌似乎是社交媒体用户之间或群体之间的日常对话，但随着社交媒体环境中共享信息、内容和议程的多样性，遭遇的情况也在日益增加。随着科技的发展，有必要开发一个利用人工智能技术检测欺凌行为的平台。在这些平台的开发过程中，我们遇到的文本分类问题的最大困难之一是需要训练用于标记数据的人工智能算法。在这项研究中，我们选择了21个不同的人，包括记者、运动员、科学家、医生、政治家、喜剧演员、社交媒体现象和积极使用社交媒体的艺术家，以创建必要的数据集来训练即将开发的模型，以检测网络欺凌情况。将这21个人通过Twitter发送的公开信息(提及)进行汇总。在从编译的10500条tweet中过滤掉bot帐户发送的重复和无意义的消息后，数据集中的消息数量减少到7706条。标记过程是分类过程中用于训练和测试目的的数据集所必需的，由三个独立的人进行，他们获得了关于网络欺凌的初步信息(1=包括网络欺凌，0=不包括网络欺凌)。大多数标签由3个不同的人阅读和分配，被接受为相关消息的最终类别。然后，根据自然语言处理的原理对数据集进行预处理，使其适合于分类算法。使用基本分类算法执行分类过程后获得的结果是共享的。当研究结果被检查时，可以理解创建的数据集具有用于检测和预防网络欺凌的能力。在此背景下，可以预测，使用相关数据集训练专门开发和优化的人工智能算法来检测网络欺凌将大大提高成功率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Advances in Artificial Intelligence Research

自引率

0.00%

发文量