A splog filtering method based on string copy detection

T. Takeda, A. Takasu
{"title":"A splog filtering method based on string copy detection","authors":"T. Takeda, A. Takasu","doi":"10.1109/ICADIWT.2008.4664407","DOIUrl":null,"url":null,"abstract":"Recently many people publicize their blogs and the blogosphere becomes an important information source. It is used for various purposes such as analyzing trends and reputations, marketing, etc. One problem of blogosphere is spam like e-mails and web links. There are many spam blogs (splogs) that are generated to make users to access specific sites. This paper proposes a splog filtering method. Splog is usually generated automatically by copying words and phrases from other documents. Therefore, the proposed method detects strings appearing in multiple blogs and uses a copy rate of strings as a key feature for splog filtering. To evaluate the proposed method, we constructed an evaluation corpus by gathering blogs randomly during a certain period of time and manually judged whether each blog is splog or not. The experiment using this corpus reveals several features of splog filtering by copy string detection. The proposed method uses the suffix array for copied substring detection and it can judge each blog with time complexity of (m2 log n) where n and m denote total length of documents used for copy detection and the lengths of the blog to be judged, respectively.","PeriodicalId":189871,"journal":{"name":"2008 First International Conference on the Applications of Digital Information and Web Technologies (ICADIWT)","volume":"112 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 First International Conference on the Applications of Digital Information and Web Technologies (ICADIWT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICADIWT.2008.4664407","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

Recently many people publicize their blogs and the blogosphere becomes an important information source. It is used for various purposes such as analyzing trends and reputations, marketing, etc. One problem of blogosphere is spam like e-mails and web links. There are many spam blogs (splogs) that are generated to make users to access specific sites. This paper proposes a splog filtering method. Splog is usually generated automatically by copying words and phrases from other documents. Therefore, the proposed method detects strings appearing in multiple blogs and uses a copy rate of strings as a key feature for splog filtering. To evaluate the proposed method, we constructed an evaluation corpus by gathering blogs randomly during a certain period of time and manually judged whether each blog is splog or not. The experiment using this corpus reveals several features of splog filtering by copy string detection. The proposed method uses the suffix array for copied substring detection and it can judge each blog with time complexity of (m2 log n) where n and m denote total length of documents used for copy detection and the lengths of the blog to be judged, respectively.
一种基于字符串复制检测的日志过滤方法
最近很多人都在宣传他们的博客,博客圈成为一个重要的信息来源。它被用于各种目的,如分析趋势和声誉,营销等。博客圈的一个问题是像电子邮件和网页链接这样的垃圾邮件。有许多垃圾博客(splogs)是为了让用户访问特定的站点而生成的。本文提出了一种splog滤波方法。Splog通常通过从其他文档复制单词和短语自动生成。因此,提出的方法检测出现在多个博客中的字符串,并使用字符串的复制速率作为博客过滤的关键特征。为了评估所提出的方法,我们通过随机收集一定时间内的博客构建一个评价语料库,并人工判断每个博客是否为博客。使用该语料库进行的实验揭示了通过复制字符串检测进行splog过滤的几个特征。该方法使用后缀数组进行复制子串检测,可以判断每个博客的时间复杂度为(m2 log n),其中n和m分别表示用于复制检测的文档的总长度和待判断博客的长度。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信