领域内和跨领域的欺骗检测:识别和理解性能差距

IF 2.9 Q3 COMPUTER SCIENCE, INFORMATION SYSTEMS

ACM Journal of Data and Information Quality Pub Date : 2022-11-22 DOI:10.1145/3561413

Subhadarshi Panda, Sarah Ita Levitan

{"title":"领域内和跨领域的欺骗检测:识别和理解性能差距","authors":"Subhadarshi Panda, Sarah Ita Levitan","doi":"10.1145/3561413","DOIUrl":null,"url":null,"abstract":"NLP approaches to automatic deception detection have gained popularity over the past few years, especially with the proliferation of fake reviews and fake news online. However, most previous studies of deception detection have focused on single domains. We currently lack information about how these single-domain models of deception may or may not generalize to new domains. In this work, we conduct empirical studies of cross-domain deception detection in five domains to understand how current models perform when evaluated on new deception domains. Our experimental results reveal a large gap between within and across domain classification performance. Motivated by these findings, we propose methods to understand the differences in performances across domains. We formulate five distance metrics that quantify the distance between pairs of deception domains. We experimentally demonstrate that the distance between a pair of domains negatively correlates with the cross-domain accuracies of the domains. We thoroughly analyze the differences in the domains and the impact of fine-tuning BERT based models by visualization of the sentence embeddings. Finally, we utilize the distance metrics to recommend the optimal source domain for any given target domain. This work highlights the need to develop robust learning algorithms for cross-domain deception detection that generalize and adapt to new domains and contributes toward that goal.","PeriodicalId":44355,"journal":{"name":"ACM Journal of Data and Information Quality","volume":"342 5","pages":"1 - 27"},"PeriodicalIF":2.9000,"publicationDate":"2022-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Deception Detection Within and Across Domains: Identifying and Understanding the Performance Gap\",\"authors\":\"Subhadarshi Panda, Sarah Ita Levitan\",\"doi\":\"10.1145/3561413\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"NLP approaches to automatic deception detection have gained popularity over the past few years, especially with the proliferation of fake reviews and fake news online. However, most previous studies of deception detection have focused on single domains. We currently lack information about how these single-domain models of deception may or may not generalize to new domains. In this work, we conduct empirical studies of cross-domain deception detection in five domains to understand how current models perform when evaluated on new deception domains. Our experimental results reveal a large gap between within and across domain classification performance. Motivated by these findings, we propose methods to understand the differences in performances across domains. We formulate five distance metrics that quantify the distance between pairs of deception domains. We experimentally demonstrate that the distance between a pair of domains negatively correlates with the cross-domain accuracies of the domains. We thoroughly analyze the differences in the domains and the impact of fine-tuning BERT based models by visualization of the sentence embeddings. Finally, we utilize the distance metrics to recommend the optimal source domain for any given target domain. This work highlights the need to develop robust learning algorithms for cross-domain deception detection that generalize and adapt to new domains and contributes toward that goal.\",\"PeriodicalId\":44355,\"journal\":{\"name\":\"ACM Journal of Data and Information Quality\",\"volume\":\"342 5\",\"pages\":\"1 - 27\"},\"PeriodicalIF\":2.9000,\"publicationDate\":\"2022-11-22\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Journal of Data and Information Quality\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3561413\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Journal of Data and Information Quality","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3561413","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

摘要

在过去的几年里，自动欺骗检测的NLP方法越来越受欢迎，尤其是在网上虚假评论和虚假新闻泛滥的情况下。然而，大多数先前的欺骗检测研究都集中在单一领域。我们目前缺乏关于这些单域欺骗模型如何可能或可能不会推广到新领域的信息。在这项工作中，我们对五个领域的跨领域欺骗检测进行了实证研究，以了解当前模型在新的欺骗领域评估时的表现。我们的实验结果表明，域内和跨域分类性能之间存在很大差距。在这些发现的激励下，我们提出了理解跨领域性能差异的方法。我们制定了五个距离度量来量化欺骗域对之间的距离。我们通过实验证明，一对域之间的距离与域的跨域精度呈负相关。我们通过对句子嵌入的可视化，深入分析了基于BERT的模型在领域上的差异和微调的影响。最后，我们利用距离度量为任何给定的目标域推荐最优源域。这项工作强调了开发跨领域欺骗检测的鲁棒学习算法的必要性，这些算法可以推广和适应新的领域，并有助于实现这一目标。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Deception Detection Within and Across Domains: Identifying and Understanding the Performance Gap

NLP approaches to automatic deception detection have gained popularity over the past few years, especially with the proliferation of fake reviews and fake news online. However, most previous studies of deception detection have focused on single domains. We currently lack information about how these single-domain models of deception may or may not generalize to new domains. In this work, we conduct empirical studies of cross-domain deception detection in five domains to understand how current models perform when evaluated on new deception domains. Our experimental results reveal a large gap between within and across domain classification performance. Motivated by these findings, we propose methods to understand the differences in performances across domains. We formulate five distance metrics that quantify the distance between pairs of deception domains. We experimentally demonstrate that the distance between a pair of domains negatively correlates with the cross-domain accuracies of the domains. We thoroughly analyze the differences in the domains and the impact of fine-tuning BERT based models by visualization of the sentence embeddings. Finally, we utilize the distance metrics to recommend the optimal source domain for any given target domain. This work highlights the need to develop robust learning algorithms for cross-domain deception detection that generalize and adapt to new domains and contributes toward that goal.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

ACM Journal of Data and Information Quality COMPUTER SCIENCE, INFORMATION SYSTEMS-

CiteScore

4.10

自引率

4.80%

发文量