LWCS:一个基于锚图哈希的大规模网页分类系统

2015 6th IEEE International Conference on Software Engineering and Service Science (ICSESS) Pub Date : 2015-11-30 DOI:10.1109/ICSESS.2015.7339012

Yi Zheng, Chengcheng Sun, Chengzhang Zhu, Xv Lan, Xiang Fu, Weihong Han

{"title":"LWCS:一个基于锚图哈希的大规模网页分类系统","authors":"Yi Zheng, Chengcheng Sun, Chengzhang Zhu, Xv Lan, Xiang Fu, Weihong Han","doi":"10.1109/ICSESS.2015.7339012","DOIUrl":null,"url":null,"abstract":"Nowadays, while we are enjoying the convenience brought by such a huge repository of online web information, we may come across difficulties in finding the web pages we want related to particular information we are searching for. Hence, it is essential to classify web documents to facilitate the search and retrieval of pages. Existing algorithms work well with a small quantity of web pages, whereas, they become slow and even non-effective while dealing with a large scale of web pages. Recently, some of these algorithms were adapted to distributed platforms which boosted their classification speeds effectively. However, due to high dimensions of web page features, the parallel classifiers were still trained with limited capacity training sets. In addition, these methods didn't improve the classification itself, merely boosted by high computing performance of distributed platforms. So oriented to large-scale web page classification, we propose to integrate anchor graph hashing with K-Nearest Neighbour(KNN) classifier to reduce the pages' original feature dimensions. The hash value of each page is used for training and classification instead of the original vectors. Experimental comparison with the original KNN on a large dataset demonstrates the efficacy of our proposed method.","PeriodicalId":335871,"journal":{"name":"2015 6th IEEE International Conference on Software Engineering and Service Science (ICSESS)","volume":"49 14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"LWCS: A large-scale web page classification system based on anchor graph hashing\",\"authors\":\"Yi Zheng, Chengcheng Sun, Chengzhang Zhu, Xv Lan, Xiang Fu, Weihong Han\",\"doi\":\"10.1109/ICSESS.2015.7339012\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Nowadays, while we are enjoying the convenience brought by such a huge repository of online web information, we may come across difficulties in finding the web pages we want related to particular information we are searching for. Hence, it is essential to classify web documents to facilitate the search and retrieval of pages. Existing algorithms work well with a small quantity of web pages, whereas, they become slow and even non-effective while dealing with a large scale of web pages. Recently, some of these algorithms were adapted to distributed platforms which boosted their classification speeds effectively. However, due to high dimensions of web page features, the parallel classifiers were still trained with limited capacity training sets. In addition, these methods didn't improve the classification itself, merely boosted by high computing performance of distributed platforms. So oriented to large-scale web page classification, we propose to integrate anchor graph hashing with K-Nearest Neighbour(KNN) classifier to reduce the pages' original feature dimensions. The hash value of each page is used for training and classification instead of the original vectors. Experimental comparison with the original KNN on a large dataset demonstrates the efficacy of our proposed method.\",\"PeriodicalId\":335871,\"journal\":{\"name\":\"2015 6th IEEE International Conference on Software Engineering and Service Science (ICSESS)\",\"volume\":\"49 14 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-11-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 6th IEEE International Conference on Software Engineering and Service Science (ICSESS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICSESS.2015.7339012\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 6th IEEE International Conference on Software Engineering and Service Science (ICSESS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSESS.2015.7339012","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

摘要

如今，当我们享受如此庞大的在线网络信息库所带来的便利时，我们可能会遇到找不到我们想要的与我们正在搜索的特定信息相关的网页的困难。因此，对web文档进行分类以促进页面的搜索和检索是必要的。现有的算法在处理少量网页时表现良好，而在处理大规模网页时则变得缓慢甚至无效。近年来，一些算法被应用于分布式平台，有效地提高了分类速度。然而，由于网页特征的高维，并行分类器的训练仍然是有限容量的训练集。此外，这些方法并没有提高分类本身，只是利用分布式平台的高计算性能来提高分类的效率。因此，针对大规模网页分类，我们提出将锚图哈希与k近邻(KNN)分类器相结合，降低网页的原始特征维数。每个页面的哈希值被用于训练和分类，而不是原始向量。与原始KNN在大型数据集上的实验比较表明了本文方法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

LWCS: A large-scale web page classification system based on anchor graph hashing

Nowadays, while we are enjoying the convenience brought by such a huge repository of online web information, we may come across difficulties in finding the web pages we want related to particular information we are searching for. Hence, it is essential to classify web documents to facilitate the search and retrieval of pages. Existing algorithms work well with a small quantity of web pages, whereas, they become slow and even non-effective while dealing with a large scale of web pages. Recently, some of these algorithms were adapted to distributed platforms which boosted their classification speeds effectively. However, due to high dimensions of web page features, the parallel classifiers were still trained with limited capacity training sets. In addition, these methods didn't improve the classification itself, merely boosted by high computing performance of distributed platforms. So oriented to large-scale web page classification, we propose to integrate anchor graph hashing with K-Nearest Neighbour(KNN) classifier to reduce the pages' original feature dimensions. The hash value of each page is used for training and classification instead of the original vectors. Experimental comparison with the original KNN on a large dataset demonstrates the efficacy of our proposed method.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2015 6th IEEE International Conference on Software Engineering and Service Science (ICSESS)

自引率

0.00%

发文量