iSRD: Spam review detection with imbalanced data distributions

Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration (IEEE IRI 2014) Pub Date : 2014-08-01 DOI:10.1109/IRI.2014.7051938

Hamzah Al Najada, Xingquan Zhu

{"title":"iSRD: Spam review detection with imbalanced data distributions","authors":"Hamzah Al Najada, Xingquan Zhu","doi":"10.1109/IRI.2014.7051938","DOIUrl":null,"url":null,"abstract":"Internet is playing an essential role for modern information systems. Applications, such as e-commerce websites, are becoming popularly available for people to purchase different types of products online. During such an online shopping process, users often rely on online review reports from previous customers to make the final decision. Because online reviews are playing essential roles for the selling of online products (or services), some vendors (or customers) are providing fake/spam reviews to mislead the customers. Any false reviews of the products may result in unfair market competition and financial loss for the customers or vendors. In this research, we aim to distinguish between spam and non-spam reviews by using supervised classification methods. When training a classifier to identify spam vs. non-spam reviews, a challenging issue is that spam reviews are only a very small portion of the online review reports. This naturally leads to a data imbalance issue for training classifiers for spam review detection, where learning methods without emphasizing on minority samples (i.e., spams) may result in poor performance in detecting spam reviews (although the overall accuracy of the algorithm might be relatively high). In order to tackle the challenge, we employ a bagging based approach to build a number of balanced datasets, through which we can train a set of spam classifiers and use their ensemble to detect review spams. Experiments and comparisons demonstrate that our method, iSRD, outperforms baseline methods for review spam detection.","PeriodicalId":360013,"journal":{"name":"Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration (IEEE IRI 2014)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"37","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration (IEEE IRI 2014)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IRI.2014.7051938","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 37

Abstract

Internet is playing an essential role for modern information systems. Applications, such as e-commerce websites, are becoming popularly available for people to purchase different types of products online. During such an online shopping process, users often rely on online review reports from previous customers to make the final decision. Because online reviews are playing essential roles for the selling of online products (or services), some vendors (or customers) are providing fake/spam reviews to mislead the customers. Any false reviews of the products may result in unfair market competition and financial loss for the customers or vendors. In this research, we aim to distinguish between spam and non-spam reviews by using supervised classification methods. When training a classifier to identify spam vs. non-spam reviews, a challenging issue is that spam reviews are only a very small portion of the online review reports. This naturally leads to a data imbalance issue for training classifiers for spam review detection, where learning methods without emphasizing on minority samples (i.e., spams) may result in poor performance in detecting spam reviews (although the overall accuracy of the algorithm might be relatively high). In order to tackle the challenge, we employ a bagging based approach to build a number of balanced datasets, through which we can train a set of spam classifiers and use their ensemble to detect review spams. Experiments and comparisons demonstrate that our method, iSRD, outperforms baseline methods for review spam detection.

查看原文本刊更多论文

iSRD:不平衡数据分布的垃圾邮件审查检测

互联网是现代信息系统的重要组成部分。电子商务网站等应用程序正变得越来越普遍，人们可以在网上购买不同类型的产品。在这样的网上购物过程中，用户往往依靠之前顾客的在线评论报告来做出最终决定。由于在线评论对在线产品(或服务)的销售起着至关重要的作用，一些供应商(或客户)提供虚假/垃圾评论来误导客户。任何对产品的虚假评论都可能导致不公平的市场竞争和客户或供应商的经济损失。在这项研究中，我们的目标是通过使用监督分类方法来区分垃圾邮件和非垃圾邮件评论。当训练分类器来识别垃圾邮件和非垃圾邮件评论时，一个具有挑战性的问题是，垃圾邮件评论只占在线评论报告的很小一部分。这自然会导致用于垃圾邮件审查检测的训练分类器的数据不平衡问题，其中不强调少数样本(即垃圾邮件)的学习方法可能会导致检测垃圾邮件审查的性能较差(尽管算法的总体准确性可能相对较高)。为了应对这一挑战，我们采用了一种基于装袋的方法来构建一些平衡的数据集，通过这些数据集，我们可以训练一组垃圾邮件分类器，并使用它们的集合来检测评论垃圾邮件。实验和比较表明，我们的方法，iSRD，优于审查垃圾邮件检测的基准方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2014 IEEE 15th International Conference on Information Reuse and Integration (IEEE IRI 2014)

自引率

0.00%

发文量