A Cross-Checking Based Method for Fraudulent Detection on E-Commercial Crawling Data

2016 International Conference on Advanced Computing and Applications (ACOMP) Pub Date : 2016-11-01 DOI:10.1109/ACOMP.2016.015

T. K. Dang, Duc Dan Ho, D. Pham, An Khuong Vo, H. Nguyen

{"title":"A Cross-Checking Based Method for Fraudulent Detection on E-Commercial Crawling Data","authors":"T. K. Dang, Duc Dan Ho, D. Pham, An Khuong Vo, H. Nguyen","doi":"10.1109/ACOMP.2016.015","DOIUrl":null,"url":null,"abstract":"Marketing research through collecting data from e-commercial websites comes with latent risks of receiving inaccurate data which have been modified before they are returned, especially when the crawling processes are conducted by other service providers. The risk of data being modified is often dismissed in related research works of web crawling systems. Avoiding this problem requires an examination phase where the data are collected for the second time for comparisons. However, the cost for re-crawling processes to simply examine all the data is significant as it will double the original cost. In this paper, we introduce an efficient approach to choose potential data which are most likely to have been modified for later re-crawling processes. By this approach, we can reduce the cost for examining, but still guarantee the data achieve their authenticity. We then measure the efficiency of our scheme while testing the ability to detect fraudulent data in a dataset containing simulated modified data. Results show that our scheme can reduce considerably the amount of data to be re-crawled but still cover most of the fraudulent data. As an example, by applying our scheme to select the data to be re-crawled from a real-world e-commercial website, with a set in which fraudulent data occupy 50 percentages, we only need to re-collect 50 percentages of total data to detect up to 80 percentages of fraudulent data, which is clearly more efficient than choosing randomly the same amount of data to be re-crawled. We conclude by discussing the accuracy measurement of the proposed model.","PeriodicalId":133451,"journal":{"name":"2016 International Conference on Advanced Computing and Applications (ACOMP)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 International Conference on Advanced Computing and Applications (ACOMP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ACOMP.2016.015","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

Marketing research through collecting data from e-commercial websites comes with latent risks of receiving inaccurate data which have been modified before they are returned, especially when the crawling processes are conducted by other service providers. The risk of data being modified is often dismissed in related research works of web crawling systems. Avoiding this problem requires an examination phase where the data are collected for the second time for comparisons. However, the cost for re-crawling processes to simply examine all the data is significant as it will double the original cost. In this paper, we introduce an efficient approach to choose potential data which are most likely to have been modified for later re-crawling processes. By this approach, we can reduce the cost for examining, but still guarantee the data achieve their authenticity. We then measure the efficiency of our scheme while testing the ability to detect fraudulent data in a dataset containing simulated modified data. Results show that our scheme can reduce considerably the amount of data to be re-crawled but still cover most of the fraudulent data. As an example, by applying our scheme to select the data to be re-crawled from a real-world e-commercial website, with a set in which fraudulent data occupy 50 percentages, we only need to re-collect 50 percentages of total data to detect up to 80 percentages of fraudulent data, which is clearly more efficient than choosing randomly the same amount of data to be re-crawled. We conclude by discussing the accuracy measurement of the proposed model.

查看原文本刊更多论文

基于交叉检验的电子商务抓取数据欺诈检测方法

通过从电子商务网站收集数据进行的营销研究存在收到不准确数据的潜在风险，这些数据在返回之前已被修改，特别是当其他服务提供商进行爬行过程时。在网络爬虫系统的相关研究工作中，数据被修改的风险往往被忽视。为了避免这个问题，需要一个检查阶段，在这个阶段中，第二次收集数据以进行比较。但是，重新爬行进程以简单地检查所有数据的成本非常高，因为它将是原始成本的两倍。在本文中，我们引入了一种有效的方法来选择最有可能被修改的潜在数据，以供以后的重新爬行过程使用。通过这种方法，既可以降低检测成本，又可以保证数据的真实性。然后，我们在测试在包含模拟修改数据的数据集中检测欺诈数据的能力时，测量我们方案的效率。结果表明，我们的方案可以大大减少需要重新爬行的数据量，但仍然覆盖了大多数欺诈数据。例如，应用我们的方案从一个真实的电子商务网站中选择需要重新抓取的数据，其中欺诈数据占50%，我们只需要重新收集总数据的50%，就可以检测到高达80%的欺诈数据，这显然比随机选择相同数量的数据进行重新抓取更有效率。最后，我们讨论了所提出模型的精度测量。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 International Conference on Advanced Computing and Applications (ACOMP)

自引率

0.00%

发文量