面向Web信息层次挖掘的聚类

Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003) Pub Date : 2003-10-13 DOI:10.1109/WI.2003.1241299

Hung-Yu kao, Jan-Ming Ho, Ming-Syan Chen

{"title":"面向Web信息层次挖掘的聚类","authors":"Hung-Yu kao, Jan-Ming Ho, Ming-Syan Chen","doi":"10.1109/WI.2003.1241299","DOIUrl":null,"url":null,"abstract":"Benefiting from the growth of techniques of dynamic page generation, the amount and the complexity of Web pages increase explosively. The structures of Web pages which are dynamically generated by the same templates are thus similar to one another and are usually assembled by a set of fundamental information clusters These neighboring information clusters usually represent the similar semantics and form a larger cluster with the more generalized information. The hierarchical structure generated by information clusters in a bottom-up manner is called the information hierarchy of a page. We study the problem of mining the information hierarchies of pages in Web sites to recognize the information distribution of pages within the multilevel, multigranularity configurations. Explicitly, we propose an information clustering system that applies a top-down information centroid searching algorithm and a multigranularity centroid converging process on the document object model (DOM) trees of pages to build the information hierarchies of pages. Experiments on several real news Web sites show the high precision and recall rates of the proposed method on determining information clusters of pages and also validate its practical applicability to real Web sites.","PeriodicalId":403574,"journal":{"name":"Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2003-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Clustering for Web information hierarchy mining\",\"authors\":\"Hung-Yu kao, Jan-Ming Ho, Ming-Syan Chen\",\"doi\":\"10.1109/WI.2003.1241299\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Benefiting from the growth of techniques of dynamic page generation, the amount and the complexity of Web pages increase explosively. The structures of Web pages which are dynamically generated by the same templates are thus similar to one another and are usually assembled by a set of fundamental information clusters These neighboring information clusters usually represent the similar semantics and form a larger cluster with the more generalized information. The hierarchical structure generated by information clusters in a bottom-up manner is called the information hierarchy of a page. We study the problem of mining the information hierarchies of pages in Web sites to recognize the information distribution of pages within the multilevel, multigranularity configurations. Explicitly, we propose an information clustering system that applies a top-down information centroid searching algorithm and a multigranularity centroid converging process on the document object model (DOM) trees of pages to build the information hierarchies of pages. Experiments on several real news Web sites show the high precision and recall rates of the proposed method on determining information clusters of pages and also validate its practical applicability to real Web sites.\",\"PeriodicalId\":403574,\"journal\":{\"name\":\"Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003)\",\"volume\":\"80 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2003-10-13\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/WI.2003.1241299\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WI.2003.1241299","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

得益于动态页面生成技术的发展，Web页面的数量和复杂性呈爆炸式增长。因此，由相同模板动态生成的网页结构彼此相似，通常由一组基本信息聚类组合而成，这些相邻的信息聚类通常表示相似的语义，并以更一般化的信息组成更大的聚类。信息集群以自下而上的方式生成的层次结构称为页面的信息层次结构。我们研究了挖掘Web站点中页面的信息层次结构的问题，以识别多层次、多粒度配置中页面的信息分布。明确地，我们提出了一种信息聚类系统，该系统在页面的文档对象模型(DOM)树上应用自顶向下的信息质心搜索算法和多粒度质心收敛过程来构建页面的信息层次结构。在几个真实的新闻网站上进行的实验表明，该方法在确定页面信息聚类方面具有较高的准确率和召回率，验证了该方法在真实网站中的实用性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Clustering for Web information hierarchy mining

Benefiting from the growth of techniques of dynamic page generation, the amount and the complexity of Web pages increase explosively. The structures of Web pages which are dynamically generated by the same templates are thus similar to one another and are usually assembled by a set of fundamental information clusters These neighboring information clusters usually represent the similar semantics and form a larger cluster with the more generalized information. The hierarchical structure generated by information clusters in a bottom-up manner is called the information hierarchy of a page. We study the problem of mining the information hierarchies of pages in Web sites to recognize the information distribution of pages within the multilevel, multigranularity configurations. Explicitly, we propose an information clustering system that applies a top-down information centroid searching algorithm and a multigranularity centroid converging process on the document object model (DOM) trees of pages to build the information hierarchies of pages. Experiments on several real news Web sites show the high precision and recall rates of the proposed method on determining information clusters of pages and also validate its practical applicability to real Web sites.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003)

自引率

0.00%

发文量