A splog filtering method based on string copy detection

2008 First International Conference on the Applications of Digital Information and Web Technologies (ICADIWT) Pub Date : 2008-10-31 DOI:10.1109/ICADIWT.2008.4664407

T. Takeda, A. Takasu

{"title":"A splog filtering method based on string copy detection","authors":"T. Takeda, A. Takasu","doi":"10.1109/ICADIWT.2008.4664407","DOIUrl":null,"url":null,"abstract":"Recently many people publicize their blogs and the blogosphere becomes an important information source. It is used for various purposes such as analyzing trends and reputations, marketing, etc. One problem of blogosphere is spam like e-mails and web links. There are many spam blogs (splogs) that are generated to make users to access specific sites. This paper proposes a splog filtering method. Splog is usually generated automatically by copying words and phrases from other documents. Therefore, the proposed method detects strings appearing in multiple blogs and uses a copy rate of strings as a key feature for splog filtering. To evaluate the proposed method, we constructed an evaluation corpus by gathering blogs randomly during a certain period of time and manually judged whether each blog is splog or not. The experiment using this corpus reveals several features of splog filtering by copy string detection. The proposed method uses the suffix array for copied substring detection and it can judge each blog with time complexity of (m2 log n) where n and m denote total length of documents used for copy detection and the lengths of the blog to be judged, respectively.","PeriodicalId":189871,"journal":{"name":"2008 First International Conference on the Applications of Digital Information and Web Technologies (ICADIWT)","volume":"112 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 First International Conference on the Applications of Digital Information and Web Technologies (ICADIWT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICADIWT.2008.4664407","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Recently many people publicize their blogs and the blogosphere becomes an important information source. It is used for various purposes such as analyzing trends and reputations, marketing, etc. One problem of blogosphere is spam like e-mails and web links. There are many spam blogs (splogs) that are generated to make users to access specific sites. This paper proposes a splog filtering method. Splog is usually generated automatically by copying words and phrases from other documents. Therefore, the proposed method detects strings appearing in multiple blogs and uses a copy rate of strings as a key feature for splog filtering. To evaluate the proposed method, we constructed an evaluation corpus by gathering blogs randomly during a certain period of time and manually judged whether each blog is splog or not. The experiment using this corpus reveals several features of splog filtering by copy string detection. The proposed method uses the suffix array for copied substring detection and it can judge each blog with time complexity of (m2 log n) where n and m denote total length of documents used for copy detection and the lengths of the blog to be judged, respectively.

查看原文本刊更多论文

一种基于字符串复制检测的日志过滤方法

最近很多人都在宣传他们的博客，博客圈成为一个重要的信息来源。它被用于各种目的，如分析趋势和声誉，营销等。博客圈的一个问题是像电子邮件和网页链接这样的垃圾邮件。有许多垃圾博客(splogs)是为了让用户访问特定的站点而生成的。本文提出了一种splog滤波方法。Splog通常通过从其他文档复制单词和短语自动生成。因此，提出的方法检测出现在多个博客中的字符串，并使用字符串的复制速率作为博客过滤的关键特征。为了评估所提出的方法，我们通过随机收集一定时间内的博客构建一个评价语料库，并人工判断每个博客是否为博客。使用该语料库进行的实验揭示了通过复制字符串检测进行splog过滤的几个特征。该方法使用后缀数组进行复制子串检测，可以判断每个博客的时间复杂度为(m2 log n)，其中n和m分别表示用于复制检测的文档的总长度和待判断博客的长度。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2008 First International Conference on the Applications of Digital Information and Web Technologies (ICADIWT)

自引率

0.00%

发文量