Developing a hierarchical model for unraveling conspiracy theories

IF 2.5 2区计算机科学 Q1 MATHEMATICS, INTERDISCIPLINARY APPLICATIONS

EPJ Data Science Pub Date : 2024-04-16 DOI:10.1140/epjds/s13688-024-00470-5

Mohsen Ghasemizade, Jeremiah Onaolapo

{"title":"Developing a hierarchical model for unraveling conspiracy theories","authors":"Mohsen Ghasemizade, Jeremiah Onaolapo","doi":"10.1140/epjds/s13688-024-00470-5","DOIUrl":null,"url":null,"abstract":"<p>A conspiracy theory (CT) suggests covert groups or powerful individuals secretly manipulate events. Not knowing about existing conspiracy theories could make one more likely to believe them, so this work aims to compile a list of CTs shaped as a tree that is as comprehensive as possible. We began with a manually curated ‘tree’ of CTs from academic papers and Wikipedia. Next, we examined 1769 CT-related articles from four fact-checking websites, focusing on their core content, and used a technique called Keyphrase Extraction to label the documents. This process yielded 769 identified conspiracies, each assigned a label and a family name. The second goal of this project was to detect whether an article is a conspiracy theory, so we built a binary classifier with our labeled dataset. This model uses a transformer-based machine learning technique and is pre-trained on a large corpus called RoBERTa, resulting in an F1 score of 87%. This model helps to identify potential conspiracy theories in new articles. We used a combination of clustering (HDBSCAN) and a dimension reduction technique (UMAP) to assign a label from the tree to these new articles detected as conspiracy theories. We then labeled these groups accordingly to help us match them to the tree. These can lead us to detect new conspiracy theories and expand the tree using computational methods. We successfully generated a tree of conspiracy theories and built a pipeline to detect and categorize conspiracy theories within any text corpora. This pipeline gives us valuable insights through any databases formatted as text.</p>","PeriodicalId":11887,"journal":{"name":"EPJ Data Science","volume":"1 1","pages":""},"PeriodicalIF":2.5000,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"EPJ Data Science","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1140/epjds/s13688-024-00470-5","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICS, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}

引用次数: 0

Abstract

A conspiracy theory (CT) suggests covert groups or powerful individuals secretly manipulate events. Not knowing about existing conspiracy theories could make one more likely to believe them, so this work aims to compile a list of CTs shaped as a tree that is as comprehensive as possible. We began with a manually curated ‘tree’ of CTs from academic papers and Wikipedia. Next, we examined 1769 CT-related articles from four fact-checking websites, focusing on their core content, and used a technique called Keyphrase Extraction to label the documents. This process yielded 769 identified conspiracies, each assigned a label and a family name. The second goal of this project was to detect whether an article is a conspiracy theory, so we built a binary classifier with our labeled dataset. This model uses a transformer-based machine learning technique and is pre-trained on a large corpus called RoBERTa, resulting in an F1 score of 87%. This model helps to identify potential conspiracy theories in new articles. We used a combination of clustering (HDBSCAN) and a dimension reduction technique (UMAP) to assign a label from the tree to these new articles detected as conspiracy theories. We then labeled these groups accordingly to help us match them to the tree. These can lead us to detect new conspiracy theories and expand the tree using computational methods. We successfully generated a tree of conspiracy theories and built a pipeline to detect and categorize conspiracy theories within any text corpora. This pipeline gives us valuable insights through any databases formatted as text.

Abstract Image

查看原文本刊更多论文

建立揭示阴谋论的分层模型

阴谋论（CT）是指秘密团体或有权势的个人暗中操纵事件。不了解现有的阴谋论可能会让人更容易相信它们，因此这项工作旨在编制一份尽可能全面的阴谋论树状列表。我们首先从学术论文和维基百科中人工编辑了一棵 CT "树"。接下来，我们检查了四个事实核查网站中与 CT 相关的 1769 篇文章，重点关注其核心内容，并使用一种名为 "关键词提取 "的技术对文档进行标注。在此过程中，我们识别出了 769 个阴谋，每个阴谋都有一个标签和姓氏。这个项目的第二个目标是检测一篇文章是否是阴谋论，因此我们用标注过的数据集建立了一个二元分类器。该模型使用了基于变换器的机器学习技术，并在名为 RoBERTa 的大型语料库上进行了预训练，结果 F1 得分为 87%。该模型有助于识别新文章中潜在的阴谋论。我们结合使用了聚类（HDBSCAN）和降维技术（UMAP），为这些被检测为阴谋论的新文章分配树标签。然后，我们对这些组进行相应的标记，以帮助我们将它们与树进行匹配。这些可以帮助我们检测出新的阴谋论，并使用计算方法扩展树。我们成功生成了一棵阴谋论树，并建立了一个在任何文本语料库中检测和分类阴谋论的管道。通过该管道，我们可以从任何文本格式的数据库中获得有价值的见解。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

EPJ Data Science MATHEMATICS, INTERDISCIPLINARY APPLICATIONS -

CiteScore

6.10

自引率

5.60%

发文量

审稿时长

13 weeks

期刊介绍： EPJ Data Science covers a broad range of research areas and applications and particularly encourages contributions from techno-socio-economic systems, where it comprises those research lines that now regard the digital “tracks” of human beings as first-order objects for scientific investigation. Topics include, but are not limited to, human behavior, social interaction (including animal societies), economic and financial systems, management and business networks, socio-technical infrastructure, health and environmental systems, the science of science, as well as general risk and crisis scenario forecasting up to and including policy advice.