{"title":"基于团的网页分类校正方法","authors":"Belmouhcine Abdelbadie, Benkhalifa Mohammed","doi":"10.1109/WI-IAT.2014.135","DOIUrl":null,"url":null,"abstract":"Nowadays, the web is the most relevant data source. Its size does not stop growing day by day. Web page classification becomes crucial due to this overwhelming amount of data. Web pages contain many noisy contents that bias textual classifiers and lead them to lose focus on their main subject. Web pages are related to each other either implicitly by users' intuitive judgments or explicitly by hyperlinks. Thus, the use of those links in order to correct a class assigned by textual classifier to a web page can be beneficial. In this paper, we propose a post classification corrective approach called Clique Based Correction (CBC) that uses the query-log to build an implicit neighborhood, and collectively corrects classes assigned by a textual classifier to web pages of that neighborhood. This correction helps improve text classifier's results by correcting wrongly assigned categories. When two web pages are linked to each other, they may share the same topic, but when more web pages (three for example) are all related to each other, the probability that those web pages share the same subject becomes stronger. The proposed method operates in four steps. In the first step, it builds a graph called implicit graph, whose vertices are web pages and edges are implicit links. In the second step, it uses a text classifier to determine classes of all web pages represented by vertices in the implicit graph. In the third step, it extracts cliques of web pages from the implicit graph. In the fourth step, it assigns a class to every clique using a voting process. Each web page will be labeled using the class of its clique. This adjustment leads to improvements of results provided by the text classifier. We conduct our experiments using three classifiers: SVM (Support Vector Machine), NB (Naïve Bayes) and KNN (K Nearest Neighbors), on two subsets of ODP (Open Directory Project). Results show that: (1) when applied after SVM, NB or KNN, CBC helps bringing improvements on results. (2) The number of unrelated web pages must be low in order to have significant improvement.","PeriodicalId":120608,"journal":{"name":"Proceedings of the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) - Volume 02","volume":"131 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"A Clique Based Web Page Classification Corrective Approach\",\"authors\":\"Belmouhcine Abdelbadie, Benkhalifa Mohammed\",\"doi\":\"10.1109/WI-IAT.2014.135\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Nowadays, the web is the most relevant data source. Its size does not stop growing day by day. Web page classification becomes crucial due to this overwhelming amount of data. Web pages contain many noisy contents that bias textual classifiers and lead them to lose focus on their main subject. Web pages are related to each other either implicitly by users' intuitive judgments or explicitly by hyperlinks. Thus, the use of those links in order to correct a class assigned by textual classifier to a web page can be beneficial. In this paper, we propose a post classification corrective approach called Clique Based Correction (CBC) that uses the query-log to build an implicit neighborhood, and collectively corrects classes assigned by a textual classifier to web pages of that neighborhood. This correction helps improve text classifier's results by correcting wrongly assigned categories. When two web pages are linked to each other, they may share the same topic, but when more web pages (three for example) are all related to each other, the probability that those web pages share the same subject becomes stronger. The proposed method operates in four steps. In the first step, it builds a graph called implicit graph, whose vertices are web pages and edges are implicit links. In the second step, it uses a text classifier to determine classes of all web pages represented by vertices in the implicit graph. In the third step, it extracts cliques of web pages from the implicit graph. In the fourth step, it assigns a class to every clique using a voting process. Each web page will be labeled using the class of its clique. This adjustment leads to improvements of results provided by the text classifier. We conduct our experiments using three classifiers: SVM (Support Vector Machine), NB (Naïve Bayes) and KNN (K Nearest Neighbors), on two subsets of ODP (Open Directory Project). Results show that: (1) when applied after SVM, NB or KNN, CBC helps bringing improvements on results. (2) The number of unrelated web pages must be low in order to have significant improvement.\",\"PeriodicalId\":120608,\"journal\":{\"name\":\"Proceedings of the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) - Volume 02\",\"volume\":\"131 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-08-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) - Volume 02\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/WI-IAT.2014.135\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2014 IEEE/WIC/ACM International Joint Conferences on Web Intelligence (WI) and Intelligent Agent Technologies (IAT) - Volume 02","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WI-IAT.2014.135","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A Clique Based Web Page Classification Corrective Approach
Nowadays, the web is the most relevant data source. Its size does not stop growing day by day. Web page classification becomes crucial due to this overwhelming amount of data. Web pages contain many noisy contents that bias textual classifiers and lead them to lose focus on their main subject. Web pages are related to each other either implicitly by users' intuitive judgments or explicitly by hyperlinks. Thus, the use of those links in order to correct a class assigned by textual classifier to a web page can be beneficial. In this paper, we propose a post classification corrective approach called Clique Based Correction (CBC) that uses the query-log to build an implicit neighborhood, and collectively corrects classes assigned by a textual classifier to web pages of that neighborhood. This correction helps improve text classifier's results by correcting wrongly assigned categories. When two web pages are linked to each other, they may share the same topic, but when more web pages (three for example) are all related to each other, the probability that those web pages share the same subject becomes stronger. The proposed method operates in four steps. In the first step, it builds a graph called implicit graph, whose vertices are web pages and edges are implicit links. In the second step, it uses a text classifier to determine classes of all web pages represented by vertices in the implicit graph. In the third step, it extracts cliques of web pages from the implicit graph. In the fourth step, it assigns a class to every clique using a voting process. Each web page will be labeled using the class of its clique. This adjustment leads to improvements of results provided by the text classifier. We conduct our experiments using three classifiers: SVM (Support Vector Machine), NB (Naïve Bayes) and KNN (K Nearest Neighbors), on two subsets of ODP (Open Directory Project). Results show that: (1) when applied after SVM, NB or KNN, CBC helps bringing improvements on results. (2) The number of unrelated web pages must be low in order to have significant improvement.