LWCS:一个基于锚图哈希的大规模网页分类系统

Yi Zheng, Chengcheng Sun, Chengzhang Zhu, Xv Lan, Xiang Fu, Weihong Han
{"title":"LWCS:一个基于锚图哈希的大规模网页分类系统","authors":"Yi Zheng, Chengcheng Sun, Chengzhang Zhu, Xv Lan, Xiang Fu, Weihong Han","doi":"10.1109/ICSESS.2015.7339012","DOIUrl":null,"url":null,"abstract":"Nowadays, while we are enjoying the convenience brought by such a huge repository of online web information, we may come across difficulties in finding the web pages we want related to particular information we are searching for. Hence, it is essential to classify web documents to facilitate the search and retrieval of pages. Existing algorithms work well with a small quantity of web pages, whereas, they become slow and even non-effective while dealing with a large scale of web pages. Recently, some of these algorithms were adapted to distributed platforms which boosted their classification speeds effectively. However, due to high dimensions of web page features, the parallel classifiers were still trained with limited capacity training sets. In addition, these methods didn't improve the classification itself, merely boosted by high computing performance of distributed platforms. So oriented to large-scale web page classification, we propose to integrate anchor graph hashing with K-Nearest Neighbour(KNN) classifier to reduce the pages' original feature dimensions. The hash value of each page is used for training and classification instead of the original vectors. Experimental comparison with the original KNN on a large dataset demonstrates the efficacy of our proposed method.","PeriodicalId":335871,"journal":{"name":"2015 6th IEEE International Conference on Software Engineering and Service Science (ICSESS)","volume":"49 14 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"LWCS: A large-scale web page classification system based on anchor graph hashing\",\"authors\":\"Yi Zheng, Chengcheng Sun, Chengzhang Zhu, Xv Lan, Xiang Fu, Weihong Han\",\"doi\":\"10.1109/ICSESS.2015.7339012\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Nowadays, while we are enjoying the convenience brought by such a huge repository of online web information, we may come across difficulties in finding the web pages we want related to particular information we are searching for. Hence, it is essential to classify web documents to facilitate the search and retrieval of pages. Existing algorithms work well with a small quantity of web pages, whereas, they become slow and even non-effective while dealing with a large scale of web pages. Recently, some of these algorithms were adapted to distributed platforms which boosted their classification speeds effectively. However, due to high dimensions of web page features, the parallel classifiers were still trained with limited capacity training sets. In addition, these methods didn't improve the classification itself, merely boosted by high computing performance of distributed platforms. So oriented to large-scale web page classification, we propose to integrate anchor graph hashing with K-Nearest Neighbour(KNN) classifier to reduce the pages' original feature dimensions. The hash value of each page is used for training and classification instead of the original vectors. Experimental comparison with the original KNN on a large dataset demonstrates the efficacy of our proposed method.\",\"PeriodicalId\":335871,\"journal\":{\"name\":\"2015 6th IEEE International Conference on Software Engineering and Service Science (ICSESS)\",\"volume\":\"49 14 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2015-11-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2015 6th IEEE International Conference on Software Engineering and Service Science (ICSESS)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICSESS.2015.7339012\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 6th IEEE International Conference on Software Engineering and Service Science (ICSESS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSESS.2015.7339012","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9

摘要

如今,当我们享受如此庞大的在线网络信息库所带来的便利时,我们可能会遇到找不到我们想要的与我们正在搜索的特定信息相关的网页的困难。因此,对web文档进行分类以促进页面的搜索和检索是必要的。现有的算法在处理少量网页时表现良好,而在处理大规模网页时则变得缓慢甚至无效。近年来,一些算法被应用于分布式平台,有效地提高了分类速度。然而,由于网页特征的高维,并行分类器的训练仍然是有限容量的训练集。此外,这些方法并没有提高分类本身,只是利用分布式平台的高计算性能来提高分类的效率。因此,针对大规模网页分类,我们提出将锚图哈希与k近邻(KNN)分类器相结合,降低网页的原始特征维数。每个页面的哈希值被用于训练和分类,而不是原始向量。与原始KNN在大型数据集上的实验比较表明了本文方法的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
LWCS: A large-scale web page classification system based on anchor graph hashing
Nowadays, while we are enjoying the convenience brought by such a huge repository of online web information, we may come across difficulties in finding the web pages we want related to particular information we are searching for. Hence, it is essential to classify web documents to facilitate the search and retrieval of pages. Existing algorithms work well with a small quantity of web pages, whereas, they become slow and even non-effective while dealing with a large scale of web pages. Recently, some of these algorithms were adapted to distributed platforms which boosted their classification speeds effectively. However, due to high dimensions of web page features, the parallel classifiers were still trained with limited capacity training sets. In addition, these methods didn't improve the classification itself, merely boosted by high computing performance of distributed platforms. So oriented to large-scale web page classification, we propose to integrate anchor graph hashing with K-Nearest Neighbour(KNN) classifier to reduce the pages' original feature dimensions. The hash value of each page is used for training and classification instead of the original vectors. Experimental comparison with the original KNN on a large dataset demonstrates the efficacy of our proposed method.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信