Towards universal methods for fake news detection

IF 2.3 3区计算机科学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE

Natural Language Engineering Pub Date : 2022-10-26 DOI:10.1017/s1351324922000456

M. Pszona, M. Janicka, Grzegorz Wojdyga, A. Wawer

{"title":"Towards universal methods for fake news detection","authors":"M. Pszona, M. Janicka, Grzegorz Wojdyga, A. Wawer","doi":"10.1017/s1351324922000456","DOIUrl":null,"url":null,"abstract":"Abstract Fake news detection is an emerging topic that has attracted a lot of attention among researchers and in the industry. This paper focuses on fake news detection as a text classification problem: on the basis of five publicly available corpora with documents labeled as true or fake, the task was to automatically distinguish both classes without relying on fact-checking. The aim of our research was to test the feasibility of a universal model: one that produces satisfactory results on all data sets tested in our article. We attempted to do so by training a set of classification models on one collection and testing them on another. As it turned out, this resulted in a sharp performance degradation. Therefore, this paper focuses on finding the most effective approach to utilizing information in a transferable manner. We examined a variety of methods: feature selection, machine learning approaches to data set shift (instance re-weighting and projection-based), and deep learning approaches based on domain transfer. These methods were applied to various feature spaces: linguistic and psycholinguistic, embeddings obtained from the Universal Sentence Encoder, and GloVe embeddings. A detailed analysis showed that some combinations of these methods and selected feature spaces bring significant improvements. When using linguistic data, feature selection yielded the best overall mean improvement (across all train-test pairs) of 4%. Among the domain adaptation methods, the greatest improvement of 3% was achieved by subspace alignment.","PeriodicalId":49143,"journal":{"name":"Natural Language Engineering","volume":"29 1","pages":"1004 - 1042"},"PeriodicalIF":2.3000,"publicationDate":"2022-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Natural Language Engineering","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1017/s1351324922000456","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 0

Abstract

Abstract Fake news detection is an emerging topic that has attracted a lot of attention among researchers and in the industry. This paper focuses on fake news detection as a text classification problem: on the basis of five publicly available corpora with documents labeled as true or fake, the task was to automatically distinguish both classes without relying on fact-checking. The aim of our research was to test the feasibility of a universal model: one that produces satisfactory results on all data sets tested in our article. We attempted to do so by training a set of classification models on one collection and testing them on another. As it turned out, this resulted in a sharp performance degradation. Therefore, this paper focuses on finding the most effective approach to utilizing information in a transferable manner. We examined a variety of methods: feature selection, machine learning approaches to data set shift (instance re-weighting and projection-based), and deep learning approaches based on domain transfer. These methods were applied to various feature spaces: linguistic and psycholinguistic, embeddings obtained from the Universal Sentence Encoder, and GloVe embeddings. A detailed analysis showed that some combinations of these methods and selected feature spaces bring significant improvements. When using linguistic data, feature selection yielded the best overall mean improvement (across all train-test pairs) of 4%. Among the domain adaptation methods, the greatest improvement of 3% was achieved by subspace alignment.

查看原文本刊更多论文

走向通用的假新闻检测方法

摘要假新闻检测是一个新兴的话题，引起了研究者和业界的广泛关注。本文将假新闻检测作为一个文本分类问题:在五个公开可用的语料库的基础上，将文档标记为真实或虚假，任务是在不依赖事实检查的情况下自动区分这两个类别。我们研究的目的是测试通用模型的可行性:在我们文章中测试的所有数据集上产生令人满意的结果。我们试图通过在一个集合上训练一组分类模型并在另一个集合上测试它们来做到这一点。事实证明，这导致了性能的急剧下降。因此，本文的重点是寻找以可转移的方式利用信息的最有效方法。我们研究了各种方法:特征选择，数据集移位的机器学习方法(实例重新加权和基于投影的)，以及基于域转移的深度学习方法。这些方法被应用于不同的特征空间:语言和心理语言，从通用句子编码器获得的嵌入，以及GloVe嵌入。详细的分析表明，这些方法和选择的特征空间的某些组合带来了显著的改进。当使用语言数据时，特征选择产生了4%的最佳总体平均改进(在所有训练-测试对中)。在领域自适应方法中，子空间对齐方法的改进幅度最大，达到3%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Natural Language Engineering COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE-

CiteScore

5.90

自引率

12.00%

发文量

审稿时长

>12 weeks

期刊介绍： Natural Language Engineering meets the needs of professionals and researchers working in all areas of computerised language processing, whether from the perspective of theoretical or descriptive linguistics, lexicology, computer science or engineering. Its aim is to bridge the gap between traditional computational linguistics research and the implementation of practical applications with potential real-world use. As well as publishing research articles on a broad range of topics - from text analysis, machine translation, information retrieval and speech analysis and generation to integrated systems and multi modal interfaces - it also publishes special issues on specific areas and technologies within these topics, an industry watch column and book reviews.