Detection of near-duplicate user generated contents: the SMS spam collection

SMUC '11 Pub Date : 2011-10-28 DOI:10.1145/2065023.2065031
Enrique Vallés, Paolo Rosso
{"title":"Detection of near-duplicate user generated contents: the SMS spam collection","authors":"Enrique Vallés, Paolo Rosso","doi":"10.1145/2065023.2065031","DOIUrl":null,"url":null,"abstract":"Today, the number of spam text messages has grown in number, mainly because companies are looking for free advertising. For the users is very important to filter these kinds of spam messages that can be viewed as near-duplicate texts because mostly created from templates. The identification of spam text messages is a very hard and time-consuming task and it involves to carefully scanning hundreds of text messages. Therefore, since the task of near-duplicate detection can be seen as a specific case of plagiarism detection, we investigated whether plagiarism detection tools could be used as filters for spam text messages. Moreover we solve the near-duplicate detection problem on the basis of a clustering approach using CLUTO framework. We carried out some preliminary experiments on the SMS Spam Collection that recently was made available for research purposes. The results were compared with the ones obtained with the CLUTO. Althought plagiarism detection tools detect a good number of near-duplicate SMS spam messages even better results are obtained with the CLUTO clustering tool.","PeriodicalId":341071,"journal":{"name":"SMUC '11","volume":"81 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"22","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"SMUC '11","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2065023.2065031","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 22

Abstract

Today, the number of spam text messages has grown in number, mainly because companies are looking for free advertising. For the users is very important to filter these kinds of spam messages that can be viewed as near-duplicate texts because mostly created from templates. The identification of spam text messages is a very hard and time-consuming task and it involves to carefully scanning hundreds of text messages. Therefore, since the task of near-duplicate detection can be seen as a specific case of plagiarism detection, we investigated whether plagiarism detection tools could be used as filters for spam text messages. Moreover we solve the near-duplicate detection problem on the basis of a clustering approach using CLUTO framework. We carried out some preliminary experiments on the SMS Spam Collection that recently was made available for research purposes. The results were compared with the ones obtained with the CLUTO. Althought plagiarism detection tools detect a good number of near-duplicate SMS spam messages even better results are obtained with the CLUTO clustering tool.
检测近乎重复的用户生成内容:SMS垃圾邮件收集
今天,垃圾短信的数量在增长,主要是因为公司在寻找免费广告。对于用户来说,过滤这些类型的垃圾邮件非常重要,这些垃圾邮件可以被视为近乎重复的文本,因为它们大多是从模板创建的。垃圾短信的识别是一项非常困难和耗时的任务,需要仔细扫描数百条短信。因此,由于近重复检测任务可以被视为剽窃检测的一个具体案例,我们研究了剽窃检测工具是否可以用作垃圾短信的过滤器。此外,我们还利用CLUTO框架解决了基于聚类的近重复检测问题。我们对最近为研究目的而提供的SMS垃圾邮件收集进行了一些初步实验。并与CLUTO的结果进行了比较。虽然抄袭检测工具可以检测到大量近乎重复的SMS垃圾邮件,但使用CLUTO聚类工具可以获得更好的结果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信