{"title":"MalCov:马来语新冠病毒假新闻数据集","authors":"N. H. A. Rahim, M. Basri","doi":"10.1109/IVIT55443.2022.10033374","DOIUrl":null,"url":null,"abstract":"The COVID-19 pandemic has drastically changed the world's atmosphere. The virus itself has spread worldwide, so the misinformation related to COVID-19 also created chaos in society. The inaccurate use of infodemic terminology created misleading info about the disease. This tragedy caused panic, confusion among the public, and miscommunication between government information and the public. Several attempts using automated classification via machine learning models have been recently made to avoid the spread of this fake news. These methods require the use of labeled data. However, the scarcity of available corpora for predictive modeling, particularly in languages other than English, is a big barrier challenge in this area. To date, our proposed research may be the first step in an extensive study of fake news detection in the Malay language. We introduce MalCov (Malaysia Covid) fake news dataset for the purpose. The MalCov which contains 79.5% fake articles or approximately 171 statements are gathered from main social media platforms. The remaining statements are valid articles that have been checked and manually validated by the local authorities. All these articles are gathered from a single portal called \"Sebenarnya.my\" Since we are using a non-English language for this dataset, the data has been separated into contents and titles. The most frequent words used are then analyzed. Several machine learning models such as Naïve Bayes, SVM, and Logistic Regression are utilized to build the classifiers. As a result, the decision tree achieves the highest performance, which is 93.48%. Keywords—dataset; fake news; fake news detection; machine learning classification; Malay language.","PeriodicalId":325667,"journal":{"name":"2022 International Visualization, Informatics and Technology Conference (IVIT)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"MalCov: Covid-19 Fake News Dataset in the Malay Language\",\"authors\":\"N. H. A. Rahim, M. Basri\",\"doi\":\"10.1109/IVIT55443.2022.10033374\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The COVID-19 pandemic has drastically changed the world's atmosphere. The virus itself has spread worldwide, so the misinformation related to COVID-19 also created chaos in society. The inaccurate use of infodemic terminology created misleading info about the disease. This tragedy caused panic, confusion among the public, and miscommunication between government information and the public. Several attempts using automated classification via machine learning models have been recently made to avoid the spread of this fake news. These methods require the use of labeled data. However, the scarcity of available corpora for predictive modeling, particularly in languages other than English, is a big barrier challenge in this area. To date, our proposed research may be the first step in an extensive study of fake news detection in the Malay language. We introduce MalCov (Malaysia Covid) fake news dataset for the purpose. The MalCov which contains 79.5% fake articles or approximately 171 statements are gathered from main social media platforms. The remaining statements are valid articles that have been checked and manually validated by the local authorities. All these articles are gathered from a single portal called \\\"Sebenarnya.my\\\" Since we are using a non-English language for this dataset, the data has been separated into contents and titles. The most frequent words used are then analyzed. Several machine learning models such as Naïve Bayes, SVM, and Logistic Regression are utilized to build the classifiers. As a result, the decision tree achieves the highest performance, which is 93.48%. Keywords—dataset; fake news; fake news detection; machine learning classification; Malay language.\",\"PeriodicalId\":325667,\"journal\":{\"name\":\"2022 International Visualization, Informatics and Technology Conference (IVIT)\",\"volume\":\"83 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 International Visualization, Informatics and Technology Conference (IVIT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IVIT55443.2022.10033374\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Visualization, Informatics and Technology Conference (IVIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IVIT55443.2022.10033374","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
MalCov: Covid-19 Fake News Dataset in the Malay Language
The COVID-19 pandemic has drastically changed the world's atmosphere. The virus itself has spread worldwide, so the misinformation related to COVID-19 also created chaos in society. The inaccurate use of infodemic terminology created misleading info about the disease. This tragedy caused panic, confusion among the public, and miscommunication between government information and the public. Several attempts using automated classification via machine learning models have been recently made to avoid the spread of this fake news. These methods require the use of labeled data. However, the scarcity of available corpora for predictive modeling, particularly in languages other than English, is a big barrier challenge in this area. To date, our proposed research may be the first step in an extensive study of fake news detection in the Malay language. We introduce MalCov (Malaysia Covid) fake news dataset for the purpose. The MalCov which contains 79.5% fake articles or approximately 171 statements are gathered from main social media platforms. The remaining statements are valid articles that have been checked and manually validated by the local authorities. All these articles are gathered from a single portal called "Sebenarnya.my" Since we are using a non-English language for this dataset, the data has been separated into contents and titles. The most frequent words used are then analyzed. Several machine learning models such as Naïve Bayes, SVM, and Logistic Regression are utilized to build the classifiers. As a result, the decision tree achieves the highest performance, which is 93.48%. Keywords—dataset; fake news; fake news detection; machine learning classification; Malay language.