通过点击数据过滤不良内容

Proceedings of the 22nd ACM international conference on Information & Knowledge Management Pub Date : 2013-10-27 DOI:10.1145/2505515.2507849

Lung-Hao Lee, Yen-Cheng Juan, Hsin-Hsi Chen, Yuen-Hsien Tseng

{"title":"通过点击数据过滤不良内容","authors":"Lung-Hao Lee, Yen-Cheng Juan, Hsin-Hsi Chen, Yuen-Hsien Tseng","doi":"10.1145/2505515.2507849","DOIUrl":null,"url":null,"abstract":"This paper explores users' browsing intents to predict the category of a user's next access during web surfing, and applies the results to objectionable content filtering. A user's access trail represented as a sequence of URLs reveals the contextual information of web browsing behaviors. We extract behavioral features of each clicked URL, i.e., hostname, bag-of-words, gTLD, IP, and port, to develop a linear chain CRF model for context-aware category prediction. Large-scale experiments show that our method achieves a promising accuracy of 0.9396 for objectionable access identification without requesting their corresponding page content. Error analysis indicates that our proposed model results in a low false positive rate of 0.0571. In real-life filtering simulations, our proposed model accomplishes macro-averaging blocking rate 0.9271, while maintaining a favorably low macro-averaging over-blocking rate 0.0575 for collaboratively filtering objectionable content with time change on the dynamic web.","PeriodicalId":20528,"journal":{"name":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","volume":"62 7","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2013-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Objectionable content filtering by click-through data\",\"authors\":\"Lung-Hao Lee, Yen-Cheng Juan, Hsin-Hsi Chen, Yuen-Hsien Tseng\",\"doi\":\"10.1145/2505515.2507849\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper explores users' browsing intents to predict the category of a user's next access during web surfing, and applies the results to objectionable content filtering. A user's access trail represented as a sequence of URLs reveals the contextual information of web browsing behaviors. We extract behavioral features of each clicked URL, i.e., hostname, bag-of-words, gTLD, IP, and port, to develop a linear chain CRF model for context-aware category prediction. Large-scale experiments show that our method achieves a promising accuracy of 0.9396 for objectionable access identification without requesting their corresponding page content. Error analysis indicates that our proposed model results in a low false positive rate of 0.0571. In real-life filtering simulations, our proposed model accomplishes macro-averaging blocking rate 0.9271, while maintaining a favorably low macro-averaging over-blocking rate 0.0575 for collaboratively filtering objectionable content with time change on the dynamic web.\",\"PeriodicalId\":20528,\"journal\":{\"name\":\"Proceedings of the 22nd ACM international conference on Information & Knowledge Management\",\"volume\":\"62 7\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-10-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 22nd ACM international conference on Information & Knowledge Management\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2505515.2507849\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 22nd ACM international conference on Information & Knowledge Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2505515.2507849","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

本文通过研究用户的浏览意图来预测用户在网上冲浪时下一次访问的类别，并将结果应用于不良内容过滤。以url序列表示的用户访问轨迹揭示了web浏览行为的上下文信息。我们提取每个被点击URL的行为特征，即主机名、词袋、gTLD、IP和端口，以开发用于上下文感知类别预测的线性链CRF模型。大规模实验表明，我们的方法在不要求相应页面内容的情况下对不良访问进行识别，准确率达到0.9396。误差分析表明，我们提出的模型的假阳性率为0.0571。在实际过滤模拟中，我们提出的模型实现了宏观平均阻塞率0.9271，同时保持了一个有利的低宏观平均过阻塞率0.0575，以协同过滤动态网络上随时间变化的不良内容。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Objectionable content filtering by click-through data

This paper explores users' browsing intents to predict the category of a user's next access during web surfing, and applies the results to objectionable content filtering. A user's access trail represented as a sequence of URLs reveals the contextual information of web browsing behaviors. We extract behavioral features of each clicked URL, i.e., hostname, bag-of-words, gTLD, IP, and port, to develop a linear chain CRF model for context-aware category prediction. Large-scale experiments show that our method achieves a promising accuracy of 0.9396 for objectionable access identification without requesting their corresponding page content. Error analysis indicates that our proposed model results in a low false positive rate of 0.0571. In real-life filtering simulations, our proposed model accomplishes macro-averaging blocking rate 0.9271, while maintaining a favorably low macro-averaging over-blocking rate 0.0575 for collaboratively filtering objectionable content with time change on the dynamic web.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 22nd ACM international conference on Information & Knowledge Management

自引率

0.00%

发文量