COVID-19 Misinformation Detection: Machine-Learned Solutions to the Infodemic.

IF 3.5 Q1 HEALTH CARE SCIENCES & SERVICES

JMIR infodemiology Pub Date : 2022-08-25 eCollection Date: 2022-07-01 DOI:10.2196/38756

Nikhil Kolluri, Yunong Liu, Dhiraj Murthy

{"title":"COVID-19 Misinformation Detection: Machine-Learned Solutions to the Infodemic.","authors":"Nikhil Kolluri, Yunong Liu, Dhiraj Murthy","doi":"10.2196/38756","DOIUrl":null,"url":null,"abstract":"Background: The volume of COVID-19-related misinformation has long exceeded the resources available to fact checkers to effectively mitigate its ill effects. Automated and web-based approaches can provide effective deterrents to online misinformation. Machine learning-based methods have achieved robust performance on text classification tasks, including potentially low-quality-news credibility assessment. Despite the progress of initial, rapid interventions, the enormity of COVID-19-related misinformation continues to overwhelm fact checkers. Therefore, improvement in automated and machine-learned methods for an infodemic response is urgently needed.Objective: The aim of this study was to achieve improvement in automated and machine-learned methods for an infodemic response.Methods: We evaluated three strategies for training a machine-learning model to determine the highest model performance: (1) COVID-19-related fact-checked data only, (2) general fact-checked data only, and (3) combined COVID-19 and general fact-checked data. We created two COVID-19-related misinformation data sets from fact-checked \"false\" content combined with programmatically retrieved \"true\" content. The first set contained ~7000 entries from July to August 2020, and the second contained ~31,000 entries from January 2020 to June 2022. We crowdsourced 31,441 votes to human label the first data set.Results: The models achieved an accuracy of 96.55% and 94.56% on the first and second external validation data set, respectively. Our best-performing model was developed using COVID-19-specific content. We were able to successfully develop combined models that outperformed human votes of misinformation. Specifically, when we blended our model predictions with human votes, the highest accuracy we achieved on the first external validation data set was 99.1%. When we considered outputs where the machine-learning model agreed with human votes, we achieved accuracies up to 98.59% on the first validation data set. This outperformed human votes alone with an accuracy of only 73%.Conclusions: External validation accuracies of 96.55% and 94.56% are evidence that machine learning can produce superior results for the difficult task of classifying the veracity of COVID-19 content. Pretrained language models performed best when fine-tuned on a topic-specific data set, while other models achieved their best accuracy when fine-tuned on a combination of topic-specific and general-topic data sets. Crucially, our study found that blended models, trained/fine-tuned on general-topic content with crowdsourced data, improved our models' accuracies up to 99.7%. The successful use of crowdsourced data can increase the accuracy of models in situations when expert-labeled data are scarce. The 98.59% accuracy on a \"high-confidence\" subsection comprised of machine-learned and human labels suggests that crowdsourced votes can optimize machine-learned labels to improve accuracy above human-only levels. These results support the utility of supervised machine learning to deter and combat future health-related disinformation.","PeriodicalId":73554,"journal":{"name":"JMIR infodemiology","volume":"2 2","pages":"e38756"},"PeriodicalIF":3.5000,"publicationDate":"2022-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9987189/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR infodemiology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/38756","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2022/7/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Background: The volume of COVID-19-related misinformation has long exceeded the resources available to fact checkers to effectively mitigate its ill effects. Automated and web-based approaches can provide effective deterrents to online misinformation. Machine learning-based methods have achieved robust performance on text classification tasks, including potentially low-quality-news credibility assessment. Despite the progress of initial, rapid interventions, the enormity of COVID-19-related misinformation continues to overwhelm fact checkers. Therefore, improvement in automated and machine-learned methods for an infodemic response is urgently needed.

Objective: The aim of this study was to achieve improvement in automated and machine-learned methods for an infodemic response.

Methods: We evaluated three strategies for training a machine-learning model to determine the highest model performance: (1) COVID-19-related fact-checked data only, (2) general fact-checked data only, and (3) combined COVID-19 and general fact-checked data. We created two COVID-19-related misinformation data sets from fact-checked "false" content combined with programmatically retrieved "true" content. The first set contained ~7000 entries from July to August 2020, and the second contained ~31,000 entries from January 2020 to June 2022. We crowdsourced 31,441 votes to human label the first data set.

Results: The models achieved an accuracy of 96.55% and 94.56% on the first and second external validation data set, respectively. Our best-performing model was developed using COVID-19-specific content. We were able to successfully develop combined models that outperformed human votes of misinformation. Specifically, when we blended our model predictions with human votes, the highest accuracy we achieved on the first external validation data set was 99.1%. When we considered outputs where the machine-learning model agreed with human votes, we achieved accuracies up to 98.59% on the first validation data set. This outperformed human votes alone with an accuracy of only 73%.

Conclusions: External validation accuracies of 96.55% and 94.56% are evidence that machine learning can produce superior results for the difficult task of classifying the veracity of COVID-19 content. Pretrained language models performed best when fine-tuned on a topic-specific data set, while other models achieved their best accuracy when fine-tuned on a combination of topic-specific and general-topic data sets. Crucially, our study found that blended models, trained/fine-tuned on general-topic content with crowdsourced data, improved our models' accuracies up to 99.7%. The successful use of crowdsourced data can increase the accuracy of models in situations when expert-labeled data are scarce. The 98.59% accuracy on a "high-confidence" subsection comprised of machine-learned and human labels suggests that crowdsourced votes can optimize machine-learned labels to improve accuracy above human-only levels. These results support the utility of supervised machine learning to deter and combat future health-related disinformation.

查看原文本刊更多论文

COVID-19 误报检测：机器学习信息解决方案。

背景：与 COVID-19 相关的虚假信息数量之大，早已超出了事实核查人员可用的资源，无法有效减轻其不良影响。自动化和基于网络的方法可以有效遏制网络误报。基于机器学习的方法已经在文本分类任务中取得了优异的成绩，包括潜在的低质量新闻可信度评估。尽管最初的快速干预措施取得了进展，但与 COVID-19 相关的大量错误信息仍然让事实核查人员不堪重负。因此，迫切需要改进自动和机器学习方法，以应对信息瘟疫：本研究的目的是改进自动和机器学习方法，以应对信息瘟疫：我们评估了训练机器学习模型的三种策略，以确定最高的模型性能：(1) 仅使用 COVID-19 相关事实校验数据，(2) 仅使用一般事实校验数据，(3) 结合 COVID-19 和一般事实校验数据。我们创建了两个与 COVID-19 相关的错误信息数据集，这些数据集由经过事实核查的 "虚假 "内容和通过程序检索的 "真实 "内容组成。第一个数据集包含 2020 年 7 月至 8 月的约 7000 个条目，第二个数据集包含 2020 年 1 月至 2022 年 6 月的约 31000 个条目。我们通过众包获得了 31,441 张选票，对第一组数据进行了人工标注：在第一个和第二个外部验证数据集上，模型的准确率分别达到 96.55% 和 94.56%。我们使用 COVID-19 的特定内容开发了表现最佳的模型。我们成功地开发出了组合模型，其表现优于对错误信息的人工投票。具体来说，当我们将模型预测与人工投票相结合时，我们在第一个外部验证数据集上达到的最高准确率为 99.1%。当我们考虑机器学习模型与人工投票一致的输出时，我们在第一个验证数据集上的准确率高达 98.59%。这超过了仅有 73% 的人工投票准确率：96.55%和94.56%的外部验证准确率证明，机器学习可以在对COVID-19内容的真实性进行分类这一艰巨任务中取得优异成绩。预训练的语言模型在特定主题数据集上进行微调时表现最佳，而其他模型在特定主题数据集和一般主题数据集的组合上进行微调时则达到最佳准确率。重要的是，我们的研究发现，在一般主题内容与众包数据上训练/微调的混合模型可将我们模型的准确率提高到 99.7%。在缺乏专家标签数据的情况下，成功使用众包数据可以提高模型的准确性。由机器学习标签和人工标签组成的 "高置信度 "分节的准确率为 98.59%，这表明众包投票可以优化机器学习标签，从而将准确率提高到纯人工水平之上。这些结果支持了监督机器学习在阻止和打击未来健康相关虚假信息方面的实用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

JMIR infodemiology

CiteScore

4.80

自引率

0.00%

发文量