CoviFake:一个检测和分析假covid推文的框架

2022 International Conference on Frontiers of Information Technology (FIT) Pub Date : 2022-12-01 DOI:10.1109/FIT57066.2022.00060

Tooba Asif, Bilal Tahir, Yasir Saleem, M. Mehmood

{"title":"CoviFake:一个检测和分析假covid推文的框架","authors":"Tooba Asif, Bilal Tahir, Yasir Saleem, M. Mehmood","doi":"10.1109/FIT57066.2022.00060","DOIUrl":null,"url":null,"abstract":"Along with the unprecedented impact of the COVID-19 pandemic on human lives, a new crisis of fake and false information related to disease has also emerged. Primarily, social media platforms such as Twitter are used to disseminate fake information due to ease of access and their large audience. However, automatic detection and classification of fake tweets is challenging task due to the complexity and lack of contextual features of short text. This paper proposes a novel CoviFake framework to classify and analyze fake tweets related to COVID-19 using vocabulary and non-vocabulary features. For this purpose, first, we combine and enhance ‘CTF’ and ‘COVID19 Rumor’ datasets to build our COVID19-sham dataset containing 25,388 labelled tweets. Next, we extract the vocabulary and 12 non-vocabulary features to compare the performance of six state-of-the-art machine learning classifiers. Our results highlight that the Random Forest (RF) classifier achieves the highest accuracy of 94.53% with the combination of top 2,000 vocabulary and 12 non-vocabulary features. In addition, we developed a large-scale dataset of CoviTweets containing 7.88 million English tweets posted by 3.8 million users during two months (March-April, 2020). The analysis of CoviTweets leveraging our framework reveals that the dataset contains 1.64 million (20.87%) fake tweets. Furthermore, we perform an in-depth examination by assigning a ‘fakeness score’ to hashtags and users in CoviTweets.","PeriodicalId":102958,"journal":{"name":"2022 International Conference on Frontiers of Information Technology (FIT)","volume":"222 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"CoviFake: A Framework to Detect and Analyze Fake COVID19 Tweets\",\"authors\":\"Tooba Asif, Bilal Tahir, Yasir Saleem, M. Mehmood\",\"doi\":\"10.1109/FIT57066.2022.00060\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Along with the unprecedented impact of the COVID-19 pandemic on human lives, a new crisis of fake and false information related to disease has also emerged. Primarily, social media platforms such as Twitter are used to disseminate fake information due to ease of access and their large audience. However, automatic detection and classification of fake tweets is challenging task due to the complexity and lack of contextual features of short text. This paper proposes a novel CoviFake framework to classify and analyze fake tweets related to COVID-19 using vocabulary and non-vocabulary features. For this purpose, first, we combine and enhance ‘CTF’ and ‘COVID19 Rumor’ datasets to build our COVID19-sham dataset containing 25,388 labelled tweets. Next, we extract the vocabulary and 12 non-vocabulary features to compare the performance of six state-of-the-art machine learning classifiers. Our results highlight that the Random Forest (RF) classifier achieves the highest accuracy of 94.53% with the combination of top 2,000 vocabulary and 12 non-vocabulary features. In addition, we developed a large-scale dataset of CoviTweets containing 7.88 million English tweets posted by 3.8 million users during two months (March-April, 2020). The analysis of CoviTweets leveraging our framework reveals that the dataset contains 1.64 million (20.87%) fake tweets. Furthermore, we perform an in-depth examination by assigning a ‘fakeness score’ to hashtags and users in CoviTweets.\",\"PeriodicalId\":102958,\"journal\":{\"name\":\"2022 International Conference on Frontiers of Information Technology (FIT)\",\"volume\":\"222 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-12-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 International Conference on Frontiers of Information Technology (FIT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/FIT57066.2022.00060\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Frontiers of Information Technology (FIT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/FIT57066.2022.00060","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

随着新冠肺炎疫情对人类生活造成前所未有的影响，与疾病相关的虚假信息也出现了新的危机。首先，Twitter等社交媒体平台由于易于访问和拥有大量受众而被用来传播虚假信息。然而，由于短文本的复杂性和缺乏上下文特征，虚假推文的自动检测和分类是一项具有挑战性的任务。本文提出了一种新的CoviFake框架，利用词汇和非词汇特征对与COVID-19相关的虚假推文进行分类和分析。为此，首先，我们结合并增强了“CTF”和“covid - 19 Rumor”数据集，以构建包含25,388条标记推文的covid - 19-sham数据集。接下来，我们提取词汇和12个非词汇特征来比较六种最先进的机器学习分类器的性能。我们的研究结果表明，随机森林(Random Forest, RF)分类器在结合前2000个词汇和12个非词汇特征时达到了94.53%的最高准确率。此外，我们开发了一个大规模的CoviTweets数据集，其中包含380万用户在两个月内(2020年3月至4月)发布的788万条英文推文。利用我们的框架对CoviTweets进行分析后发现，该数据集包含164万条(20.87%)假推文。此外，我们通过给CoviTweets中的标签和用户分配“虚假分数”来进行深入检查。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

CoviFake: A Framework to Detect and Analyze Fake COVID19 Tweets

Along with the unprecedented impact of the COVID-19 pandemic on human lives, a new crisis of fake and false information related to disease has also emerged. Primarily, social media platforms such as Twitter are used to disseminate fake information due to ease of access and their large audience. However, automatic detection and classification of fake tweets is challenging task due to the complexity and lack of contextual features of short text. This paper proposes a novel CoviFake framework to classify and analyze fake tweets related to COVID-19 using vocabulary and non-vocabulary features. For this purpose, first, we combine and enhance ‘CTF’ and ‘COVID19 Rumor’ datasets to build our COVID19-sham dataset containing 25,388 labelled tweets. Next, we extract the vocabulary and 12 non-vocabulary features to compare the performance of six state-of-the-art machine learning classifiers. Our results highlight that the Random Forest (RF) classifier achieves the highest accuracy of 94.53% with the combination of top 2,000 vocabulary and 12 non-vocabulary features. In addition, we developed a large-scale dataset of CoviTweets containing 7.88 million English tweets posted by 3.8 million users during two months (March-April, 2020). The analysis of CoviTweets leveraging our framework reveals that the dataset contains 1.64 million (20.87%) fake tweets. Furthermore, we perform an in-depth examination by assigning a ‘fakeness score’ to hashtags and users in CoviTweets.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 International Conference on Frontiers of Information Technology (FIT)

自引率

0.00%

发文量