MalCov:马来语新冠病毒假新闻数据集

2022 International Visualization, Informatics and Technology Conference (IVIT) Pub Date : 2022-11-01 DOI:10.1109/IVIT55443.2022.10033374

N. H. A. Rahim, M. Basri

{"title":"MalCov:马来语新冠病毒假新闻数据集","authors":"N. H. A. Rahim, M. Basri","doi":"10.1109/IVIT55443.2022.10033374","DOIUrl":null,"url":null,"abstract":"The COVID-19 pandemic has drastically changed the world's atmosphere. The virus itself has spread worldwide, so the misinformation related to COVID-19 also created chaos in society. The inaccurate use of infodemic terminology created misleading info about the disease. This tragedy caused panic, confusion among the public, and miscommunication between government information and the public. Several attempts using automated classification via machine learning models have been recently made to avoid the spread of this fake news. These methods require the use of labeled data. However, the scarcity of available corpora for predictive modeling, particularly in languages other than English, is a big barrier challenge in this area. To date, our proposed research may be the first step in an extensive study of fake news detection in the Malay language. We introduce MalCov (Malaysia Covid) fake news dataset for the purpose. The MalCov which contains 79.5% fake articles or approximately 171 statements are gathered from main social media platforms. The remaining statements are valid articles that have been checked and manually validated by the local authorities. All these articles are gathered from a single portal called \"Sebenarnya.my\" Since we are using a non-English language for this dataset, the data has been separated into contents and titles. The most frequent words used are then analyzed. Several machine learning models such as Naïve Bayes, SVM, and Logistic Regression are utilized to build the classifiers. As a result, the decision tree achieves the highest performance, which is 93.48%. Keywords—dataset; fake news; fake news detection; machine learning classification; Malay language.","PeriodicalId":325667,"journal":{"name":"2022 International Visualization, Informatics and Technology Conference (IVIT)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"MalCov: Covid-19 Fake News Dataset in the Malay Language\",\"authors\":\"N. H. A. Rahim, M. Basri\",\"doi\":\"10.1109/IVIT55443.2022.10033374\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The COVID-19 pandemic has drastically changed the world's atmosphere. The virus itself has spread worldwide, so the misinformation related to COVID-19 also created chaos in society. The inaccurate use of infodemic terminology created misleading info about the disease. This tragedy caused panic, confusion among the public, and miscommunication between government information and the public. Several attempts using automated classification via machine learning models have been recently made to avoid the spread of this fake news. These methods require the use of labeled data. However, the scarcity of available corpora for predictive modeling, particularly in languages other than English, is a big barrier challenge in this area. To date, our proposed research may be the first step in an extensive study of fake news detection in the Malay language. We introduce MalCov (Malaysia Covid) fake news dataset for the purpose. The MalCov which contains 79.5% fake articles or approximately 171 statements are gathered from main social media platforms. The remaining statements are valid articles that have been checked and manually validated by the local authorities. All these articles are gathered from a single portal called \\\"Sebenarnya.my\\\" Since we are using a non-English language for this dataset, the data has been separated into contents and titles. The most frequent words used are then analyzed. Several machine learning models such as Naïve Bayes, SVM, and Logistic Regression are utilized to build the classifiers. As a result, the decision tree achieves the highest performance, which is 93.48%. Keywords—dataset; fake news; fake news detection; machine learning classification; Malay language.\",\"PeriodicalId\":325667,\"journal\":{\"name\":\"2022 International Visualization, Informatics and Technology Conference (IVIT)\",\"volume\":\"83 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 International Visualization, Informatics and Technology Conference (IVIT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IVIT55443.2022.10033374\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Visualization, Informatics and Technology Conference (IVIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IVIT55443.2022.10033374","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

新冠肺炎大流行极大地改变了世界氛围。新冠病毒本身已经在世界范围内传播，因此与新冠病毒有关的错误信息也造成了社会混乱。信息学术术语的不准确使用造成了关于这种疾病的误导性信息。这一悲剧引起了公众的恐慌和困惑，以及政府信息与公众之间的沟通不畅。最近有几次尝试通过机器学习模型进行自动分类，以避免这种假新闻的传播。这些方法需要使用标记数据。然而，用于预测建模的可用语料库的稀缺性，特别是在英语以外的语言中，是该领域的一大障碍挑战。到目前为止，我们提出的研究可能是马来语假新闻检测广泛研究的第一步。为此，我们引入MalCov(马来西亚新冠病毒)假新闻数据集。虚假文章占79.5%的“MalCov”是在主要社交媒体平台上收集的，虚假文章约171条。其余的报表都是经过当地当局检查和手工验证的有效条目。所有这些文章都来自一个名为“Sebenarnya”的门户网站。由于我们对这个数据集使用的是一种非英语语言，所以数据被分成了内容和标题。然后分析使用频率最高的单词。几种机器学习模型，如Naïve贝叶斯，支持向量机和逻辑回归被用来建立分类器。结果，决策树的性能最高，为93.48%。Keywords-dataset;假新闻;假新闻检测;机器学习分类;马来语的语言。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

MalCov: Covid-19 Fake News Dataset in the Malay Language

The COVID-19 pandemic has drastically changed the world's atmosphere. The virus itself has spread worldwide, so the misinformation related to COVID-19 also created chaos in society. The inaccurate use of infodemic terminology created misleading info about the disease. This tragedy caused panic, confusion among the public, and miscommunication between government information and the public. Several attempts using automated classification via machine learning models have been recently made to avoid the spread of this fake news. These methods require the use of labeled data. However, the scarcity of available corpora for predictive modeling, particularly in languages other than English, is a big barrier challenge in this area. To date, our proposed research may be the first step in an extensive study of fake news detection in the Malay language. We introduce MalCov (Malaysia Covid) fake news dataset for the purpose. The MalCov which contains 79.5% fake articles or approximately 171 statements are gathered from main social media platforms. The remaining statements are valid articles that have been checked and manually validated by the local authorities. All these articles are gathered from a single portal called "Sebenarnya.my" Since we are using a non-English language for this dataset, the data has been separated into contents and titles. The most frequent words used are then analyzed. Several machine learning models such as Naïve Bayes, SVM, and Logistic Regression are utilized to build the classifiers. As a result, the decision tree achieves the highest performance, which is 93.48%. Keywords—dataset; fake news; fake news detection; machine learning classification; Malay language.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 International Visualization, Informatics and Technology Conference (IVIT)

自引率

0.00%

发文量