Two-phase Web site classification based on hidden Markov tree models

Yonghong Tian, Tiejun Huang, Wen Gao
{"title":"Two-phase Web site classification based on hidden Markov tree models","authors":"Yonghong Tian, Tiejun Huang, Wen Gao","doi":"10.1109/WI.2003.1241198","DOIUrl":null,"url":null,"abstract":"With the exponential growth of both the amount and diversity of the information that the Web encompasses, automatic classification of topic-specific Web sites is highly desirable. We propose a novel approach for Web site classification based on the content, structure and context information of Web sites. In our approach, the site structure is represented as a two-layered tree in which each page is modeled as a DOM (document object model) tree and a site tree is used to hierarchically link all pages within the site. Two context models are presented to capture the topic dependences in the site. Then the hidden Markov tree (HMT) model is utilized as the statistical model of the site tree and the DOM tree, and an HMT-based classifier is presented for their classification. Moreover, for reducing the download size of Web sites but still keeping high classification accuracy, an entropy-based approach is introduced to dynamically prune the site trees. On these bases, we employ the two-phase classification system for classifying Web sites through a fine-to-coarse recursion. The experiments show our approach is able to offer high accuracy and efficient process performance.","PeriodicalId":403574,"journal":{"name":"Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2003-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"33","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings IEEE/WIC International Conference on Web Intelligence (WI 2003)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WI.2003.1241198","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 33

Abstract

With the exponential growth of both the amount and diversity of the information that the Web encompasses, automatic classification of topic-specific Web sites is highly desirable. We propose a novel approach for Web site classification based on the content, structure and context information of Web sites. In our approach, the site structure is represented as a two-layered tree in which each page is modeled as a DOM (document object model) tree and a site tree is used to hierarchically link all pages within the site. Two context models are presented to capture the topic dependences in the site. Then the hidden Markov tree (HMT) model is utilized as the statistical model of the site tree and the DOM tree, and an HMT-based classifier is presented for their classification. Moreover, for reducing the download size of Web sites but still keeping high classification accuracy, an entropy-based approach is introduced to dynamically prune the site trees. On these bases, we employ the two-phase classification system for classifying Web sites through a fine-to-coarse recursion. The experiments show our approach is able to offer high accuracy and efficient process performance.
基于隐马尔可夫树模型的两阶段网站分类
随着Web包含的信息量和多样性呈指数级增长,对特定主题的Web站点进行自动分类是非常必要的。提出了一种基于网站内容、结构和上下文信息的网站分类方法。在我们的方法中,站点结构表示为两层树,其中每个页面都建模为DOM(文档对象模型)树,站点树用于分层地链接站点内的所有页面。提供了两个上下文模型来捕获站点中的主题依赖性。然后利用隐马尔可夫树(HMT)模型作为站点树和DOM树的统计模型,提出基于隐马尔可夫树的分类器对站点树和DOM树进行分类。此外,为了在减少网站下载大小的同时保持较高的分类精度,引入了一种基于熵的方法对网站树进行动态修剪。在此基础上,我们采用两阶段分类系统,通过精细到粗糙的递归对Web站点进行分类。实验表明,该方法能够提供高精度和高效的工艺性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信