基于团的网页分类校正方法

Proceedings of the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) - Volume 02 Pub Date : 2014-08-11 DOI:10.1109/WI-IAT.2014.135

Belmouhcine Abdelbadie, Benkhalifa Mohammed

{"title":"基于团的网页分类校正方法","authors":"Belmouhcine Abdelbadie, Benkhalifa Mohammed","doi":"10.1109/WI-IAT.2014.135","DOIUrl":null,"url":null,"abstract":"Nowadays, the web is the most relevant data source. Its size does not stop growing day by day. Web page classification becomes crucial due to this overwhelming amount of data. Web pages contain many noisy contents that bias textual classifiers and lead them to lose focus on their main subject. Web pages are related to each other either implicitly by users' intuitive judgments or explicitly by hyperlinks. Thus, the use of those links in order to correct a class assigned by textual classifier to a web page can be beneficial. In this paper, we propose a post classification corrective approach called Clique Based Correction (CBC) that uses the query-log to build an implicit neighborhood, and collectively corrects classes assigned by a textual classifier to web pages of that neighborhood. This correction helps improve text classifier's results by correcting wrongly assigned categories. When two web pages are linked to each other, they may share the same topic, but when more web pages (three for example) are all related to each other, the probability that those web pages share the same subject becomes stronger. The proposed method operates in four steps. In the first step, it builds a graph called implicit graph, whose vertices are web pages and edges are implicit links. In the second step, it uses a text classifier to determine classes of all web pages represented by vertices in the implicit graph. In the third step, it extracts cliques of web pages from the implicit graph. In the fourth step, it assigns a class to every clique using a voting process. Each web page will be labeled using the class of its clique. This adjustment leads to improvements of results provided by the text classifier. We conduct our experiments using three classifiers: SVM (Support Vector Machine), NB (Naïve Bayes) and KNN (K Nearest Neighbors), on two subsets of ODP (Open Directory Project). Results show that: (1) when applied after SVM, NB or KNN, CBC helps bringing improvements on results. (2) The number of unrelated web pages must be low in order to have significant improvement.","PeriodicalId":120608,"journal":{"name":"Proceedings of the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) - Volume 02","volume":"131 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"A Clique Based Web Page Classification Corrective Approach\",\"authors\":\"Belmouhcine Abdelbadie, Benkhalifa Mohammed\",\"doi\":\"10.1109/WI-IAT.2014.135\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Nowadays, the web is the most relevant data source. Its size does not stop growing day by day. Web page classification becomes crucial due to this overwhelming amount of data. Web pages contain many noisy contents that bias textual classifiers and lead them to lose focus on their main subject. Web pages are related to each other either implicitly by users' intuitive judgments or explicitly by hyperlinks. Thus, the use of those links in order to correct a class assigned by textual classifier to a web page can be beneficial. In this paper, we propose a post classification corrective approach called Clique Based Correction (CBC) that uses the query-log to build an implicit neighborhood, and collectively corrects classes assigned by a textual classifier to web pages of that neighborhood. This correction helps improve text classifier's results by correcting wrongly assigned categories. When two web pages are linked to each other, they may share the same topic, but when more web pages (three for example) are all related to each other, the probability that those web pages share the same subject becomes stronger. The proposed method operates in four steps. In the first step, it builds a graph called implicit graph, whose vertices are web pages and edges are implicit links. In the second step, it uses a text classifier to determine classes of all web pages represented by vertices in the implicit graph. In the third step, it extracts cliques of web pages from the implicit graph. In the fourth step, it assigns a class to every clique using a voting process. Each web page will be labeled using the class of its clique. This adjustment leads to improvements of results provided by the text classifier. We conduct our experiments using three classifiers: SVM (Support Vector Machine), NB (Naïve Bayes) and KNN (K Nearest Neighbors), on two subsets of ODP (Open Directory Project). Results show that: (1) when applied after SVM, NB or KNN, CBC helps bringing improvements on results. (2) The number of unrelated web pages must be low in order to have significant improvement.\",\"PeriodicalId\":120608,\"journal\":{\"name\":\"Proceedings of the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) - Volume 02\",\"volume\":\"131 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-08-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) - Volume 02\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/WI-IAT.2014.135\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) - Volume 02","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WI-IAT.2014.135","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

如今，网络是最相关的数据来源。它的规模一天比一天大。由于大量的数据，网页分类变得至关重要。Web页面包含许多嘈杂的内容，这些内容会使文本分类器产生偏差，并导致它们失去对主题的关注。网页之间的相互关联，或者是通过用户的直觉判断隐含的，或者是通过超链接显式的。因此，使用这些链接来纠正由文本分类器分配给网页的类是有益的。在本文中，我们提出了一种基于Clique的后分类校正方法(CBC)，该方法使用查询日志构建隐式邻域，并将文本分类器分配的类集体校正到该邻域的网页上。这个修正通过纠正错误分配的类别来帮助改进文本分类器的结果。当两个网页相互链接时，它们可能共享同一个主题，但当更多的网页(例如三个)相互关联时，这些网页共享同一个主题的可能性就会变得更大。该方法分为四个步骤。第一步，构建一个隐式图，其顶点为网页，边为隐式链接。在第二步中，它使用文本分类器来确定隐式图中由顶点表示的所有网页的类。第三步，从隐式图中提取网页派系。在第四步中，它使用投票过程为每个集团分配一个类。每个网页将使用其所属的类来标记。这种调整可以改进文本分类器提供的结果。我们使用三个分类器进行实验:SVM(支持向量机)，NB (Naïve贝叶斯)和KNN (K近邻)，在ODP(开放目录项目)的两个子集上。结果表明:(1)在SVM、NB或KNN之后，CBC有助于改善结果。(2)不相关网页的数量必须很低，才能有显著的改善。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A Clique Based Web Page Classification Corrective Approach

Nowadays, the web is the most relevant data source. Its size does not stop growing day by day. Web page classification becomes crucial due to this overwhelming amount of data. Web pages contain many noisy contents that bias textual classifiers and lead them to lose focus on their main subject. Web pages are related to each other either implicitly by users' intuitive judgments or explicitly by hyperlinks. Thus, the use of those links in order to correct a class assigned by textual classifier to a web page can be beneficial. In this paper, we propose a post classification corrective approach called Clique Based Correction (CBC) that uses the query-log to build an implicit neighborhood, and collectively corrects classes assigned by a textual classifier to web pages of that neighborhood. This correction helps improve text classifier's results by correcting wrongly assigned categories. When two web pages are linked to each other, they may share the same topic, but when more web pages (three for example) are all related to each other, the probability that those web pages share the same subject becomes stronger. The proposed method operates in four steps. In the first step, it builds a graph called implicit graph, whose vertices are web pages and edges are implicit links. In the second step, it uses a text classifier to determine classes of all web pages represented by vertices in the implicit graph. In the third step, it extracts cliques of web pages from the implicit graph. In the fourth step, it assigns a class to every clique using a voting process. Each web page will be labeled using the class of its clique. This adjustment leads to improvements of results provided by the text classifier. We conduct our experiments using three classifiers: SVM (Support Vector Machine), NB (Naïve Bayes) and KNN (K Nearest Neighbors), on two subsets of ODP (Open Directory Project). Results show that: (1) when applied after SVM, NB or KNN, CBC helps bringing improvements on results. (2) The number of unrelated web pages must be low in order to have significant improvement.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) - Volume 02

自引率

0.00%

发文量