利用半监督异常检测识别掠夺性聊天文档

Document Recognition and Retrieval Pub Date : 2016-02-17 DOI:10.2352/ISSN.2470-1173.2016.17.DRR-063

M. Ebrahimi, C. Suen, O. Ormandjieva, A. Krzyżak

{"title":"利用半监督异常检测识别掠夺性聊天文档","authors":"M. Ebrahimi, C. Suen, O. Ormandjieva, A. Krzyżak","doi":"10.2352/ISSN.2470-1173.2016.17.DRR-063","DOIUrl":null,"url":null,"abstract":"Chat-logs are informative documents available to nowadays social network providers. Providers and law enforcement tend to use these huge logs anonymously for automatic online Sexual Predator Identification (SPI) which is a relatively new area of application. The task plays an important role in protecting children and juveniles against being exploited by online predators. Pattern recognition techniques facilitate automatic identification of harmful conversations in cyber space by law enforcements. These techniques usually require a large volume of high-quality training instances of both predatory and non-predatory documents. However, collecting non-predatory documents is not practical in real-world applications, since this category contains a large variety of documents with many topics including politics, sports, science, technology and etc. We utilized a new semi-supervised approach to mitigate this problem by adapting an anomaly detection technique called One-class Support Vector Machine which does not require non-predatory samples for training. We compared the performance of this approach against other state-ofthe-art methods which use both positive and negative instances. We observed that although anomaly detection approach utilizes only one class label for training (which is a very desirable property in practice); its performance is comparable to that of binary SVM classification. In addition, this approach outperforms the classic two-class Naïve Bayes algorithm, which we used as our baseline, in terms of both classification accuracy and precision. Introduction During the past decade, automated online Sexual Predator Identification from chat documents has boomed by means of pattern recognition techniques capable of flagging likely predators for the attention of law enforcement. The most common approach has been presented in PAN-2012 international competition [1] which was specifically engineered to accomplish the following two tasks [2]: Finding the predators vs. victims Finding the predatory messages in a predatory document The first task seems to be more important for law enforcement since it can help them to limit their search space drastically. It is worth mentioning that the second task has not been as successful as the first one due to the fact that it requires deeper natural language analysis. The first task can be performed in two steps [3]: Identifying the predatory documents in the entire conversation corpus Searching in participants of predatory documents in order to distinguish the sexual predator and victim In this paper we focus on the first step mentioned above (i.e. identifying the predatory conversations), since it will be the most proper area for helping the investigators in real-world applications. Accordingly, the main motivation behind using One-class SVM on this kind of data and treating the problem as an anomaly detection problems is making a classifier which is able to learn from only one class label instead of what we have in the traditional binary classification. Figure 1 depicts the different granularity levels for designing classifiers in online sexual predator identification. Figure 1. Classification Granularity Levels and their corresponding classification problem in SPI Section 2 describes the current status of SPI, section 3 explains the proposed approach which is based on semi-supervised anomaly detection, and section 4 dissects the document recognition process we conducted on SPI problem including pre-processing, feature extraction and pattern classification. Also, the result of comparing different methods is described in this section. Motivation According to researchers who participated in PAN-2012, There has been a major weakness in the data set: The nonpredatory and non-sexual samples were exclusively gathered from publicly available IRC logs which mainly contain the chats about computer and web technologies; therefore cannot represent “general conversations” [4]. The samples in general conversation category (which are also non-predatory) must include countless topics such as sport, music, games, computer, etc. In practice, it is not an easy task to assemble such a training data set. As a result, the current top-ranked algorithms in PAN2012 may have learned how to distinguish computer-related chats vs. sexual-related chats instead of identifying actual predatory chats in online cyber space. Accordingly, one can expect that their performance will decrease in real-world applications. In other words, we believe that although the top-ranked algorithms in PAN-2012 had significant F1-score on test data set (87% for the winner), since they require general samples that are able to represent the non-predatory data properly, their performance will decrease significantly in practical ©2016 Society for Imaging Science and Technology IS&T International Symposium on Electronic Imaging 2016 Document Recognition and Retrieval XXIII DRR-063.1 environments such as law enforcement. In this work, we propose a novel way to handle this problem by eliminating the need for having both class labels in the train data set. Due to the absence of one of the class labels in the training process, our applied method will be more practical at the expense of having a lower, but still acceptable, F1-score. Using only one class label in training process categorizes this approach as a semi-supervised classification method. Furthermore, in order to guarantee the efficiency of our approach we aim to beat the baseline (naïve Bayes algorithm) in terms of F1-Score. Note that each chat conversation represents a document in our recognition process; hence, in the remaining parts of this paper we use document and conversations interchangeably. Related Work Perhaps the first successful attempt for using machine learning in SPI problem was done by Pendar by means of weighted K-NN classifier to distinguish predators from underage victims [5]. To the best of our knowledge, the first empirical system with capability of determining predatory messages in chat logs is ChatCoder1 (and Chatcoder2) implemented and evolved by Kontostathis and her colleagues [6] [7]. The system uses a rule based approach in conjunction with decision trees and instancebased learning methods (K-NN). It is worth mentioning that in order to deal with the issue of learning imbalance data, [8] has already introduced a general approach using a weighted version of KNN algorithm to mitigate the problem of imbalanced data in text categorization which is not specifically related to the SPI. Recently, the PAN-2012 conference has acted as a boost for applying machine learning techniques to this area. The main strength of this conference is providing the first publicly available official data set which was specifically engineered for sexual predator identification task. Researchers tuned their proposed methods against the same training data and reported their performance on the test data. Several machine learning algorithms have been used to solve SPI problem in this competition. These algorithms cover a wide range of classification algorithms such as maximum entropy-based classification [9], K-NN [10], Support Vector Machine [4] and Neural Networks [3]. Eventually, one team has been announced as the winner based on their classification accuracy and an augmented F-measure. The winner team [3] has used a two-step binary classification approach called SCI (Suspicious Conversation Identification) and VFP (Victim From Predator Disclosure) using SVM and Neural Networks. Accordingly we have used SVM as the state-of-the-art method to compare the performance of our anomaly detection approach with. Escalante and his colleagues [11] proposed a new method based on learning a chain of three local classifiers corresponding to three segments of each document (i.e. conversation) but the approach could not outperform that of the winner in PAN-2012. A related research has been done on cyber bullying by Kontostathis which is very close to predator identification [12]. They utilize a different supervised learning algorithm based on latent Semantic Indexing which is called Essential Dimensions of LSI for identifying cyber bullying. They built their own data set using Form spring.me, a questin-and-answer popular website. As the most recent work, [13] have proposed enriching the traditional bag-of-word language model by adding other feature types including sentiment features, psycho-linguistic features and discourse patterns. Eventually, they have used binary classification for the actual predator identification task. Generally, the algorithms used in PAN-2012 can be considered as the state of the art in sexual predator identification. While in regard to anomaly detection, there is a wide variety of unsupervised, supervised, and semi-supervised models. A comprehensive survey of anomaly detection has been done in [14]. The authors have categorized the anomaly detection methods into six major categories: clustering based, classification based, nearest neighbor based (also includes density based methods), statistical, Information theoretic and spectral methods. We use a slightly different taxonomy to show the place of the method we use based on the learning method that is used for anomaly detection. We avoid describing different methods and foundations of anomaly detection since it is beyond the scope of this article. Instead, we focus on the specific anomaly detection method (i.e. one-class SVM) that yielded the desirable results in this application domain. Figure 2 illustrates the taxonomy of most common anomaly detection techniques as well as the position of semi-supervised techniques. Figure 2. Position of Semi-supervised and SVM-based techniques in the taxonomy of anomaly detection techniques One-class SVM has been highlighted in the figure. For the sake of completeness, the unsupervised SVM-based algorithms are shown as well. The corresponding leaf nodes of the taxonomy will ","PeriodicalId":152377,"journal":{"name":"Document Recognition and Retrieval","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"34","resultStr":"{\"title\":\"Recognizing Predatory Chat Documents using Semi-supervised Anomaly Detection\",\"authors\":\"M. Ebrahimi, C. Suen, O. Ormandjieva, A. Krzyżak\",\"doi\":\"10.2352/ISSN.2470-1173.2016.17.DRR-063\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Chat-logs are informative documents available to nowadays social network providers. Providers and law enforcement tend to use these huge logs anonymously for automatic online Sexual Predator Identification (SPI) which is a relatively new area of application. The task plays an important role in protecting children and juveniles against being exploited by online predators. Pattern recognition techniques facilitate automatic identification of harmful conversations in cyber space by law enforcements. These techniques usually require a large volume of high-quality training instances of both predatory and non-predatory documents. However, collecting non-predatory documents is not practical in real-world applications, since this category contains a large variety of documents with many topics including politics, sports, science, technology and etc. We utilized a new semi-supervised approach to mitigate this problem by adapting an anomaly detection technique called One-class Support Vector Machine which does not require non-predatory samples for training. We compared the performance of this approach against other state-ofthe-art methods which use both positive and negative instances. We observed that although anomaly detection approach utilizes only one class label for training (which is a very desirable property in practice); its performance is comparable to that of binary SVM classification. In addition, this approach outperforms the classic two-class Naïve Bayes algorithm, which we used as our baseline, in terms of both classification accuracy and precision. Introduction During the past decade, automated online Sexual Predator Identification from chat documents has boomed by means of pattern recognition techniques capable of flagging likely predators for the attention of law enforcement. The most common approach has been presented in PAN-2012 international competition [1] which was specifically engineered to accomplish the following two tasks [2]: Finding the predators vs. victims Finding the predatory messages in a predatory document The first task seems to be more important for law enforcement since it can help them to limit their search space drastically. It is worth mentioning that the second task has not been as successful as the first one due to the fact that it requires deeper natural language analysis. The first task can be performed in two steps [3]: Identifying the predatory documents in the entire conversation corpus Searching in participants of predatory documents in order to distinguish the sexual predator and victim In this paper we focus on the first step mentioned above (i.e. identifying the predatory conversations), since it will be the most proper area for helping the investigators in real-world applications. Accordingly, the main motivation behind using One-class SVM on this kind of data and treating the problem as an anomaly detection problems is making a classifier which is able to learn from only one class label instead of what we have in the traditional binary classification. Figure 1 depicts the different granularity levels for designing classifiers in online sexual predator identification. Figure 1. Classification Granularity Levels and their corresponding classification problem in SPI Section 2 describes the current status of SPI, section 3 explains the proposed approach which is based on semi-supervised anomaly detection, and section 4 dissects the document recognition process we conducted on SPI problem including pre-processing, feature extraction and pattern classification. Also, the result of comparing different methods is described in this section. Motivation According to researchers who participated in PAN-2012, There has been a major weakness in the data set: The nonpredatory and non-sexual samples were exclusively gathered from publicly available IRC logs which mainly contain the chats about computer and web technologies; therefore cannot represent “general conversations” [4]. The samples in general conversation category (which are also non-predatory) must include countless topics such as sport, music, games, computer, etc. In practice, it is not an easy task to assemble such a training data set. As a result, the current top-ranked algorithms in PAN2012 may have learned how to distinguish computer-related chats vs. sexual-related chats instead of identifying actual predatory chats in online cyber space. Accordingly, one can expect that their performance will decrease in real-world applications. In other words, we believe that although the top-ranked algorithms in PAN-2012 had significant F1-score on test data set (87% for the winner), since they require general samples that are able to represent the non-predatory data properly, their performance will decrease significantly in practical ©2016 Society for Imaging Science and Technology IS&T International Symposium on Electronic Imaging 2016 Document Recognition and Retrieval XXIII DRR-063.1 environments such as law enforcement. In this work, we propose a novel way to handle this problem by eliminating the need for having both class labels in the train data set. Due to the absence of one of the class labels in the training process, our applied method will be more practical at the expense of having a lower, but still acceptable, F1-score. Using only one class label in training process categorizes this approach as a semi-supervised classification method. Furthermore, in order to guarantee the efficiency of our approach we aim to beat the baseline (naïve Bayes algorithm) in terms of F1-Score. Note that each chat conversation represents a document in our recognition process; hence, in the remaining parts of this paper we use document and conversations interchangeably. Related Work Perhaps the first successful attempt for using machine learning in SPI problem was done by Pendar by means of weighted K-NN classifier to distinguish predators from underage victims [5]. To the best of our knowledge, the first empirical system with capability of determining predatory messages in chat logs is ChatCoder1 (and Chatcoder2) implemented and evolved by Kontostathis and her colleagues [6] [7]. The system uses a rule based approach in conjunction with decision trees and instancebased learning methods (K-NN). It is worth mentioning that in order to deal with the issue of learning imbalance data, [8] has already introduced a general approach using a weighted version of KNN algorithm to mitigate the problem of imbalanced data in text categorization which is not specifically related to the SPI. Recently, the PAN-2012 conference has acted as a boost for applying machine learning techniques to this area. The main strength of this conference is providing the first publicly available official data set which was specifically engineered for sexual predator identification task. Researchers tuned their proposed methods against the same training data and reported their performance on the test data. Several machine learning algorithms have been used to solve SPI problem in this competition. These algorithms cover a wide range of classification algorithms such as maximum entropy-based classification [9], K-NN [10], Support Vector Machine [4] and Neural Networks [3]. Eventually, one team has been announced as the winner based on their classification accuracy and an augmented F-measure. The winner team [3] has used a two-step binary classification approach called SCI (Suspicious Conversation Identification) and VFP (Victim From Predator Disclosure) using SVM and Neural Networks. Accordingly we have used SVM as the state-of-the-art method to compare the performance of our anomaly detection approach with. Escalante and his colleagues [11] proposed a new method based on learning a chain of three local classifiers corresponding to three segments of each document (i.e. conversation) but the approach could not outperform that of the winner in PAN-2012. A related research has been done on cyber bullying by Kontostathis which is very close to predator identification [12]. They utilize a different supervised learning algorithm based on latent Semantic Indexing which is called Essential Dimensions of LSI for identifying cyber bullying. They built their own data set using Form spring.me, a questin-and-answer popular website. As the most recent work, [13] have proposed enriching the traditional bag-of-word language model by adding other feature types including sentiment features, psycho-linguistic features and discourse patterns. Eventually, they have used binary classification for the actual predator identification task. Generally, the algorithms used in PAN-2012 can be considered as the state of the art in sexual predator identification. While in regard to anomaly detection, there is a wide variety of unsupervised, supervised, and semi-supervised models. A comprehensive survey of anomaly detection has been done in [14]. The authors have categorized the anomaly detection methods into six major categories: clustering based, classification based, nearest neighbor based (also includes density based methods), statistical, Information theoretic and spectral methods. We use a slightly different taxonomy to show the place of the method we use based on the learning method that is used for anomaly detection. We avoid describing different methods and foundations of anomaly detection since it is beyond the scope of this article. Instead, we focus on the specific anomaly detection method (i.e. one-class SVM) that yielded the desirable results in this application domain. Figure 2 illustrates the taxonomy of most common anomaly detection techniques as well as the position of semi-supervised techniques. Figure 2. Position of Semi-supervised and SVM-based techniques in the taxonomy of anomaly detection techniques One-class SVM has been highlighted in the figure. For the sake of completeness, the unsupervised SVM-based algorithms are shown as well. The corresponding leaf nodes of the taxonomy will \",\"PeriodicalId\":152377,\"journal\":{\"name\":\"Document Recognition and Retrieval\",\"volume\":\"25 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-02-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"34\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Document Recognition and Retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2352/ISSN.2470-1173.2016.17.DRR-063\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Document Recognition and Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2352/ISSN.2470-1173.2016.17.DRR-063","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 34

摘要

聊天日志是当今社交网络提供商可用的信息文档。提供者和执法部门倾向于匿名使用这些巨大的日志来自动在线性掠夺者识别(SPI)，这是一个相对较新的应用领域。这项任务在保护儿童和青少年免受网络掠夺者的利用方面发挥着重要作用。模式识别技术有助于执法部门自动识别网络空间中的有害对话。这些技术通常需要大量的掠夺性和非掠夺性文档的高质量训练实例。然而，收集非掠夺性文档在实际应用中是不实际的，因为这一类别包含了各种各样的文档，主题很多，包括政治、体育、科学、技术等。我们利用一种新的半监督方法来缓解这个问题，通过采用一种称为单类支持向量机的异常检测技术，该技术不需要非掠夺性样本进行训练。我们将这种方法的性能与其他使用正面和负面实例的最先进方法进行了比较。我们观察到，尽管异常检测方法只使用一个类标签进行训练(这在实践中是一个非常理想的特性);其性能可与二元支持向量机分类方法相媲美。此外，该方法在分类精度和精密度方面都优于我们用作基准的经典两类Naïve贝叶斯算法。在过去的十年里，通过模式识别技术，从聊天文件中自动识别在线性侵犯者的技术蓬勃发展，这种技术能够标记可能的性侵犯者，以引起执法部门的注意。最常见的方法已经在PAN-2012国际竞赛[1]中提出，该方法专门用于完成以下两个任务:找到掠夺者与受害者在掠夺性文档中找到掠夺性信息。第一个任务对执法部门来说似乎更重要，因为它可以帮助他们大幅限制搜索空间。值得一提的是，第二个任务并不像第一个任务那样成功，因为它需要更深入的自然语言分析。第一个任务可以分两个步骤进行:在整个对话语料库中识别掠夺性文档，搜索掠夺性文档的参与者，以区分性掠夺者和受害者。在本文中，我们重点关注上面提到的第一步(即识别掠夺性对话)，因为这将是帮助调查者在现实世界应用中最合适的领域。因此，在这类数据上使用单类支持向量机并将问题视为异常检测问题的主要动机是制作一个能够仅从一个类标签学习的分类器，而不是我们在传统的二元分类中所拥有的分类器。图1描述了在线性掠夺者识别中设计分类器的不同粒度级别。图1所示。第2节描述了SPI的现状，第3节解释了提出的基于半监督异常检测的方法，第4节剖析了我们针对SPI问题进行的文档识别过程，包括预处理、特征提取和模式分类。此外，本节还描述了比较不同方法的结果。根据参与PAN-2012的研究人员的说法，数据集存在一个主要弱点:非掠夺性和非性别样本仅从公开的IRC日志中收集，这些日志主要包含有关计算机和网络技术的聊天;因此不能代表“一般对话”[4]。一般对话类别的样本(也是非掠夺性的)必须包括无数的主题，如体育，音乐，游戏，电脑等。在实践中，组装这样一个训练数据集并不是一件容易的事情。因此，目前在PAN2012中排名靠前的算法可能已经学会了如何区分与电脑有关的聊天和与性有关的聊天，而不是识别在线网络空间中真正的掠夺性聊天。因此，可以预期它们在实际应用程序中的性能会下降。换句话说，我们认为，尽管PAN-2012中排名最高的算法在测试数据集上具有显著的f1得分(获胜者为87%)，但由于它们需要能够正确表示非掠夺性数据的一般样本，因此它们的性能在实际中会显着下降©2016年影像科学与技术学会IS&T电子成像国际研讨会2016年文档识别与检索XXIII DRR-063.1环境，如执法。相应的分类学叶节点将

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Recognizing Predatory Chat Documents using Semi-supervised Anomaly Detection

Chat-logs are informative documents available to nowadays social network providers. Providers and law enforcement tend to use these huge logs anonymously for automatic online Sexual Predator Identification (SPI) which is a relatively new area of application. The task plays an important role in protecting children and juveniles against being exploited by online predators. Pattern recognition techniques facilitate automatic identification of harmful conversations in cyber space by law enforcements. These techniques usually require a large volume of high-quality training instances of both predatory and non-predatory documents. However, collecting non-predatory documents is not practical in real-world applications, since this category contains a large variety of documents with many topics including politics, sports, science, technology and etc. We utilized a new semi-supervised approach to mitigate this problem by adapting an anomaly detection technique called One-class Support Vector Machine which does not require non-predatory samples for training. We compared the performance of this approach against other state-ofthe-art methods which use both positive and negative instances. We observed that although anomaly detection approach utilizes only one class label for training (which is a very desirable property in practice); its performance is comparable to that of binary SVM classification. In addition, this approach outperforms the classic two-class Naïve Bayes algorithm, which we used as our baseline, in terms of both classification accuracy and precision. Introduction During the past decade, automated online Sexual Predator Identification from chat documents has boomed by means of pattern recognition techniques capable of flagging likely predators for the attention of law enforcement. The most common approach has been presented in PAN-2012 international competition [1] which was specifically engineered to accomplish the following two tasks [2]: Finding the predators vs. victims Finding the predatory messages in a predatory document The first task seems to be more important for law enforcement since it can help them to limit their search space drastically. It is worth mentioning that the second task has not been as successful as the first one due to the fact that it requires deeper natural language analysis. The first task can be performed in two steps [3]: Identifying the predatory documents in the entire conversation corpus Searching in participants of predatory documents in order to distinguish the sexual predator and victim In this paper we focus on the first step mentioned above (i.e. identifying the predatory conversations), since it will be the most proper area for helping the investigators in real-world applications. Accordingly, the main motivation behind using One-class SVM on this kind of data and treating the problem as an anomaly detection problems is making a classifier which is able to learn from only one class label instead of what we have in the traditional binary classification. Figure 1 depicts the different granularity levels for designing classifiers in online sexual predator identification. Figure 1. Classification Granularity Levels and their corresponding classification problem in SPI Section 2 describes the current status of SPI, section 3 explains the proposed approach which is based on semi-supervised anomaly detection, and section 4 dissects the document recognition process we conducted on SPI problem including pre-processing, feature extraction and pattern classification. Also, the result of comparing different methods is described in this section. Motivation According to researchers who participated in PAN-2012, There has been a major weakness in the data set: The nonpredatory and non-sexual samples were exclusively gathered from publicly available IRC logs which mainly contain the chats about computer and web technologies; therefore cannot represent “general conversations” [4]. The samples in general conversation category (which are also non-predatory) must include countless topics such as sport, music, games, computer, etc. In practice, it is not an easy task to assemble such a training data set. As a result, the current top-ranked algorithms in PAN2012 may have learned how to distinguish computer-related chats vs. sexual-related chats instead of identifying actual predatory chats in online cyber space. Accordingly, one can expect that their performance will decrease in real-world applications. In other words, we believe that although the top-ranked algorithms in PAN-2012 had significant F1-score on test data set (87% for the winner), since they require general samples that are able to represent the non-predatory data properly, their performance will decrease significantly in practical ©2016 Society for Imaging Science and Technology IS&T International Symposium on Electronic Imaging 2016 Document Recognition and Retrieval XXIII DRR-063.1 environments such as law enforcement. In this work, we propose a novel way to handle this problem by eliminating the need for having both class labels in the train data set. Due to the absence of one of the class labels in the training process, our applied method will be more practical at the expense of having a lower, but still acceptable, F1-score. Using only one class label in training process categorizes this approach as a semi-supervised classification method. Furthermore, in order to guarantee the efficiency of our approach we aim to beat the baseline (naïve Bayes algorithm) in terms of F1-Score. Note that each chat conversation represents a document in our recognition process; hence, in the remaining parts of this paper we use document and conversations interchangeably. Related Work Perhaps the first successful attempt for using machine learning in SPI problem was done by Pendar by means of weighted K-NN classifier to distinguish predators from underage victims [5]. To the best of our knowledge, the first empirical system with capability of determining predatory messages in chat logs is ChatCoder1 (and Chatcoder2) implemented and evolved by Kontostathis and her colleagues [6] [7]. The system uses a rule based approach in conjunction with decision trees and instancebased learning methods (K-NN). It is worth mentioning that in order to deal with the issue of learning imbalance data, [8] has already introduced a general approach using a weighted version of KNN algorithm to mitigate the problem of imbalanced data in text categorization which is not specifically related to the SPI. Recently, the PAN-2012 conference has acted as a boost for applying machine learning techniques to this area. The main strength of this conference is providing the first publicly available official data set which was specifically engineered for sexual predator identification task. Researchers tuned their proposed methods against the same training data and reported their performance on the test data. Several machine learning algorithms have been used to solve SPI problem in this competition. These algorithms cover a wide range of classification algorithms such as maximum entropy-based classification [9], K-NN [10], Support Vector Machine [4] and Neural Networks [3]. Eventually, one team has been announced as the winner based on their classification accuracy and an augmented F-measure. The winner team [3] has used a two-step binary classification approach called SCI (Suspicious Conversation Identification) and VFP (Victim From Predator Disclosure) using SVM and Neural Networks. Accordingly we have used SVM as the state-of-the-art method to compare the performance of our anomaly detection approach with. Escalante and his colleagues [11] proposed a new method based on learning a chain of three local classifiers corresponding to three segments of each document (i.e. conversation) but the approach could not outperform that of the winner in PAN-2012. A related research has been done on cyber bullying by Kontostathis which is very close to predator identification [12]. They utilize a different supervised learning algorithm based on latent Semantic Indexing which is called Essential Dimensions of LSI for identifying cyber bullying. They built their own data set using Form spring.me, a questin-and-answer popular website. As the most recent work, [13] have proposed enriching the traditional bag-of-word language model by adding other feature types including sentiment features, psycho-linguistic features and discourse patterns. Eventually, they have used binary classification for the actual predator identification task. Generally, the algorithms used in PAN-2012 can be considered as the state of the art in sexual predator identification. While in regard to anomaly detection, there is a wide variety of unsupervised, supervised, and semi-supervised models. A comprehensive survey of anomaly detection has been done in [14]. The authors have categorized the anomaly detection methods into six major categories: clustering based, classification based, nearest neighbor based (also includes density based methods), statistical, Information theoretic and spectral methods. We use a slightly different taxonomy to show the place of the method we use based on the learning method that is used for anomaly detection. We avoid describing different methods and foundations of anomaly detection since it is beyond the scope of this article. Instead, we focus on the specific anomaly detection method (i.e. one-class SVM) that yielded the desirable results in this application domain. Figure 2 illustrates the taxonomy of most common anomaly detection techniques as well as the position of semi-supervised techniques. Figure 2. Position of Semi-supervised and SVM-based techniques in the taxonomy of anomaly detection techniques One-class SVM has been highlighted in the figure. For the sake of completeness, the unsupervised SVM-based algorithms are shown as well. The corresponding leaf nodes of the taxonomy will

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Document Recognition and Retrieval

自引率

0.00%

发文量