Automatic Recovery of Broken Links Using Information Retrieval Techniques

Shoaib Hayat, Yue Li, Muhammad Riaz
{"title":"Automatic Recovery of Broken Links Using Information Retrieval Techniques","authors":"Shoaib Hayat, Yue Li, Muhammad Riaz","doi":"10.1145/3278293.3278296","DOIUrl":null,"url":null,"abstract":"World Wide Web is very dynamic in its nature and we experienced changes in web pages every day. Web pages are updated, deleted, created or moved from one domain to another. Due to its dynamic nature often the web users experience broken links. Internet has been suffering from broken links problem despite of its contemporary services. Broken links are frequent problem occurring in web domain. Sometimes the page which was pointing from another page has been disappeared forever or moved to some other location. There are numerous reasons behind broken links. Some of these are permanently deleted Web pages, or modification made in Web pages causes broken links or the link of target page has some errors in code of source page. Researchers proposed several techniques in order to recover the broken links or at least retrieve some relevant pages. Number of sources have been used in research community for broken links recover like URL of target page, Anchor text, surround text near to anchor text and text in the source pages. All these sources of information are useful for retrieving the candidate pages relevant to broken links. System returns a ranked list of highly relevant candidate pages on submitting a query which has been extracted from different sources listed above. Previous work relies on TF (Term Frequency) or DF (Document Frequency) weights for extracting term from anchor text and full text of page containing missing links but not showed good results which cause the problem of retrieving similar pages for multiple broken links. In this paper we investigate the use of term proximity (position) relationship between the terms of anchor text and full text in order to extract relevant (good and bad) terms through classification model. This solves the problem by providing different query terms for multiple broken links and also increases the effectiveness as the terms that are proximity close to each other reveal more relevance.","PeriodicalId":183745,"journal":{"name":"Proceedings of the 2nd International Conference on Natural Language Processing and Information Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2nd International Conference on Natural Language Processing and Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3278293.3278296","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2

Abstract

World Wide Web is very dynamic in its nature and we experienced changes in web pages every day. Web pages are updated, deleted, created or moved from one domain to another. Due to its dynamic nature often the web users experience broken links. Internet has been suffering from broken links problem despite of its contemporary services. Broken links are frequent problem occurring in web domain. Sometimes the page which was pointing from another page has been disappeared forever or moved to some other location. There are numerous reasons behind broken links. Some of these are permanently deleted Web pages, or modification made in Web pages causes broken links or the link of target page has some errors in code of source page. Researchers proposed several techniques in order to recover the broken links or at least retrieve some relevant pages. Number of sources have been used in research community for broken links recover like URL of target page, Anchor text, surround text near to anchor text and text in the source pages. All these sources of information are useful for retrieving the candidate pages relevant to broken links. System returns a ranked list of highly relevant candidate pages on submitting a query which has been extracted from different sources listed above. Previous work relies on TF (Term Frequency) or DF (Document Frequency) weights for extracting term from anchor text and full text of page containing missing links but not showed good results which cause the problem of retrieving similar pages for multiple broken links. In this paper we investigate the use of term proximity (position) relationship between the terms of anchor text and full text in order to extract relevant (good and bad) terms through classification model. This solves the problem by providing different query terms for multiple broken links and also increases the effectiveness as the terms that are proximity close to each other reveal more relevance.
利用信息检索技术自动恢复断开的链接
万维网本质上是动态的,我们每天都在经历网页的变化。Web页面被更新、删除、创建或从一个域移动到另一个域。由于其动态性,网络用户经常会遇到断开的链接。尽管互联网提供了现代化的服务,但它一直饱受断链问题的困扰。断链是网络领域中经常出现的问题。有时从另一页指向的页面已经永远消失或移动到其他位置。断开链接背后有很多原因。其中一些是永久删除的网页,或者在网页中进行的修改导致链接断开,或者目标页面的链接在源页面的代码中有一些错误。研究人员提出了几种技术,以恢复断开的链接或至少检索一些相关的页面。在研究社区中,已经使用了许多源来恢复破碎的链接,如目标页面的URL,锚文本,锚文本附近的环绕文本和源页面中的文本。所有这些信息源对于检索与断开链接相关的候选页面都很有用。系统在提交查询时返回高度相关的候选页面的排名列表,该查询已从上面列出的不同来源提取。以前的工作依赖于TF (Term Frequency)或DF (Document Frequency)权重从包含缺失链接的页面的锚文本和全文中提取术语,但结果并不好,这导致了从多个断开链接中检索相似页面的问题。在本文中,我们研究了利用锚文本和全文之间的术语接近(位置)关系,通过分类模型提取相关的(好的和坏的)术语。这通过为多个断开的链接提供不同的查询词来解决问题,并且还提高了效率,因为彼此接近的词显示出更多的相关性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信