Mining the information architecture of the WWW using automated website boundary detection

Web Intell. Pub Date : 2017-11-20 DOI:10.3233/WEB-170365
Ayesh Alshukri, Frans Coenen
{"title":"Mining the information architecture of the WWW using automated website boundary detection","authors":"Ayesh Alshukri, Frans Coenen","doi":"10.3233/WEB-170365","DOIUrl":null,"url":null,"abstract":"The world wide web has two main forms of architecture, the first is that which is explicitly encoded into web pages, and the second is that which is implied by the web content, particularly pertaining to look and feel. The latter is exemplified by the concept of a website, a concept that is only loosely defined, although users intuitively understand it. The Website Boundary Detection (WBD) problem is concerned with the task of identifying the complete collection of web pages/resources that are contained within a single website. Whatever the case, the concept of a website is used with respect to a number of application domains including; website archiving, spam detection, and www analysis. In the context of such applications it is beneficial if a website can be automatically identified. This is usually done by identifying a website of interest in terms of its boundary, the so called WBD problem. In this paper seven WBD techniques are proposed and compared, four statistical techniques where the web data to be used is obtained apriori, and three dynamic techniques where the data to be used is obtained as the process progresses. All seven techniques are presented in detail and evaluated.","PeriodicalId":245783,"journal":{"name":"Web Intell.","volume":"60 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Web Intell.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3233/WEB-170365","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The world wide web has two main forms of architecture, the first is that which is explicitly encoded into web pages, and the second is that which is implied by the web content, particularly pertaining to look and feel. The latter is exemplified by the concept of a website, a concept that is only loosely defined, although users intuitively understand it. The Website Boundary Detection (WBD) problem is concerned with the task of identifying the complete collection of web pages/resources that are contained within a single website. Whatever the case, the concept of a website is used with respect to a number of application domains including; website archiving, spam detection, and www analysis. In the context of such applications it is beneficial if a website can be automatically identified. This is usually done by identifying a website of interest in terms of its boundary, the so called WBD problem. In this paper seven WBD techniques are proposed and compared, four statistical techniques where the web data to be used is obtained apriori, and three dynamic techniques where the data to be used is obtained as the process progresses. All seven techniques are presented in detail and evaluated.
利用自动网站边界检测技术挖掘WWW的信息架构
万维网有两种主要的架构形式,第一种是明确编码到网页中的架构形式,第二种是隐含在网页内容中的架构形式,特别是与外观和感觉有关的架构形式。后者以网站的概念为例,虽然用户直观地理解它,但这个概念的定义很松散。网站边界检测(WBD)问题关注的是识别包含在单个网站中的网页/资源的完整集合的任务。无论哪种情况,网站的概念用于许多应用领域,包括;网站存档,垃圾邮件检测,和WWW分析。在此类应用程序的上下文中,如果一个网站可以被自动识别是有益的。这通常是通过根据其边界确定感兴趣的网站来完成的,即所谓的WBD问题。本文提出并比较了七种WBD技术,其中四种统计技术是先验地获得要使用的网络数据,三种动态技术是随着过程的进展获得要使用的数据。详细介绍并评估了这七种技术。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信