Data Exploration and Classification of News Article Reliability: Deep Learning Study.

IF 3.5 Q1 HEALTH CARE SCIENCES & SERVICES

JMIR infodemiology Pub Date : 2022-09-22 eCollection Date: 2022-07-01 DOI:10.2196/38839

Kevin Zhan, Yutong Li, Rafay Osmani, Xiaoyu Wang, Bo Cao

{"title":"Data Exploration and Classification of News Article Reliability: Deep Learning Study.","authors":"Kevin Zhan, Yutong Li, Rafay Osmani, Xiaoyu Wang, Bo Cao","doi":"10.2196/38839","DOIUrl":null,"url":null,"abstract":"Background: During the ongoing COVID-19 pandemic, we are being exposed to large amounts of information each day. This \"infodemic\" is defined by the World Health Organization as the mass spread of misleading or false information during a pandemic. This spread of misinformation during the infodemic ultimately leads to misunderstandings of public health orders or direct opposition against public policies. Although there have been efforts to combat misinformation spread, current manual fact-checking methods are insufficient to combat the infodemic.Objective: We propose the use of natural language processing (NLP) and machine learning (ML) techniques to build a model that can be used to identify unreliable news articles online.Methods: First, we preprocessed the ReCOVery data set to obtain 2029 English news articles tagged with COVID-19 keywords from January to May 2020, which are labeled as reliable or unreliable. Data exploration was conducted to determine major differences between reliable and unreliable articles. We built an ensemble deep learning model using the body text, as well as features, such as sentiment, Empath-derived lexical categories, and readability, to classify the reliability.Results: We found that reliable news articles have a higher proportion of neutral sentiment, while unreliable articles have a higher proportion of negative sentiment. Additionally, our analysis demonstrated that reliable articles are easier to read than unreliable articles, in addition to having different lexical categories and keywords. Our new model was evaluated to achieve the following performance metrics: 0.906 area under the curve (AUC), 0.835 specificity, and 0.945 sensitivity. These values are above the baseline performance of the original ReCOVery model.Conclusions: This paper identified novel differences between reliable and unreliable news articles; moreover, the model was trained using state-of-the-art deep learning techniques. We aim to be able to use our findings to help researchers and the public audience more easily identify false information and unreliable media in their everyday lives.","PeriodicalId":73554,"journal":{"name":"JMIR infodemiology","volume":null,"pages":null},"PeriodicalIF":3.5000,"publicationDate":"2022-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9516811/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR infodemiology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/38839","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2022/7/1 0:00:00","PubModel":"eCollection","JCR":"Q1","JCRName":"HEALTH CARE SCIENCES & SERVICES","Score":null,"Total":0}

引用次数: 0

Abstract

Background: During the ongoing COVID-19 pandemic, we are being exposed to large amounts of information each day. This "infodemic" is defined by the World Health Organization as the mass spread of misleading or false information during a pandemic. This spread of misinformation during the infodemic ultimately leads to misunderstandings of public health orders or direct opposition against public policies. Although there have been efforts to combat misinformation spread, current manual fact-checking methods are insufficient to combat the infodemic.

Objective: We propose the use of natural language processing (NLP) and machine learning (ML) techniques to build a model that can be used to identify unreliable news articles online.

Methods: First, we preprocessed the ReCOVery data set to obtain 2029 English news articles tagged with COVID-19 keywords from January to May 2020, which are labeled as reliable or unreliable. Data exploration was conducted to determine major differences between reliable and unreliable articles. We built an ensemble deep learning model using the body text, as well as features, such as sentiment, Empath-derived lexical categories, and readability, to classify the reliability.

Results: We found that reliable news articles have a higher proportion of neutral sentiment, while unreliable articles have a higher proportion of negative sentiment. Additionally, our analysis demonstrated that reliable articles are easier to read than unreliable articles, in addition to having different lexical categories and keywords. Our new model was evaluated to achieve the following performance metrics: 0.906 area under the curve (AUC), 0.835 specificity, and 0.945 sensitivity. These values are above the baseline performance of the original ReCOVery model.

Conclusions: This paper identified novel differences between reliable and unreliable news articles; moreover, the model was trained using state-of-the-art deep learning techniques. We aim to be able to use our findings to help researchers and the public audience more easily identify false information and unreliable media in their everyday lives.

Abstract Image

查看原文本刊更多论文

新闻文章可靠性的数据挖掘与分类:深度学习研究。

背景:在2019冠状病毒病大流行期间，我们每天都接触到大量信息。世界卫生组织将这种“信息流行病”定义为在大流行期间大规模传播误导性或虚假信息。在信息大流行期间，这种错误信息的传播最终导致对公共卫生秩序的误解或对公共政策的直接反对。虽然一直在努力打击错误信息的传播，但目前的人工事实核查方法不足以打击信息泛滥。目的:我们建议使用自然语言处理(NLP)和机器学习(ML)技术来构建一个模型，该模型可用于在线识别不可靠的新闻文章。方法:首先，我们对ReCOVery数据集进行预处理，获取2020年1 - 5月2029篇带有COVID-19关键字标签的英文新闻，并将其标记为可靠或不可靠。进行数据探索，以确定可靠和不可靠文章之间的主要差异。我们使用正文以及情感、移情衍生的词汇类别和可读性等特征构建了一个集成深度学习模型，对可靠性进行分类。结果:我们发现可靠的新闻文章有较高比例的中性情绪，而不可靠的文章有较高比例的负面情绪。此外，我们的分析表明，除了具有不同的词汇类别和关键词外，可靠的文章比不可靠的文章更容易阅读。我们的新模型评估达到以下性能指标:曲线下面积(AUC) 0.906，特异性0.835，敏感性0.945。这些值高于原始恢复模型的基线性能。结论:本文发现了可靠和不可靠新闻文章之间的新差异;此外，该模型使用最先进的深度学习技术进行训练。我们的目标是能够利用我们的发现来帮助研究人员和公众更容易地识别日常生活中的虚假信息和不可靠的媒体。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

JMIR infodemiology

CiteScore

4.80

自引率

0.00%

发文量