Automatic Recovery of Broken Links Using Information Retrieval Techniques

Proceedings of the 2nd International Conference on Natural Language Processing and Information Retrieval Pub Date : 2018-09-07 DOI:10.1145/3278293.3278296

Shoaib Hayat, Yue Li, Muhammad Riaz

{"title":"Automatic Recovery of Broken Links Using Information Retrieval Techniques","authors":"Shoaib Hayat, Yue Li, Muhammad Riaz","doi":"10.1145/3278293.3278296","DOIUrl":null,"url":null,"abstract":"World Wide Web is very dynamic in its nature and we experienced changes in web pages every day. Web pages are updated, deleted, created or moved from one domain to another. Due to its dynamic nature often the web users experience broken links. Internet has been suffering from broken links problem despite of its contemporary services. Broken links are frequent problem occurring in web domain. Sometimes the page which was pointing from another page has been disappeared forever or moved to some other location. There are numerous reasons behind broken links. Some of these are permanently deleted Web pages, or modification made in Web pages causes broken links or the link of target page has some errors in code of source page. Researchers proposed several techniques in order to recover the broken links or at least retrieve some relevant pages. Number of sources have been used in research community for broken links recover like URL of target page, Anchor text, surround text near to anchor text and text in the source pages. All these sources of information are useful for retrieving the candidate pages relevant to broken links. System returns a ranked list of highly relevant candidate pages on submitting a query which has been extracted from different sources listed above. Previous work relies on TF (Term Frequency) or DF (Document Frequency) weights for extracting term from anchor text and full text of page containing missing links but not showed good results which cause the problem of retrieving similar pages for multiple broken links. In this paper we investigate the use of term proximity (position) relationship between the terms of anchor text and full text in order to extract relevant (good and bad) terms through classification model. This solves the problem by providing different query terms for multiple broken links and also increases the effectiveness as the terms that are proximity close to each other reveal more relevance.","PeriodicalId":183745,"journal":{"name":"Proceedings of the 2nd International Conference on Natural Language Processing and Information Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2nd International Conference on Natural Language Processing and Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3278293.3278296","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

Abstract

World Wide Web is very dynamic in its nature and we experienced changes in web pages every day. Web pages are updated, deleted, created or moved from one domain to another. Due to its dynamic nature often the web users experience broken links. Internet has been suffering from broken links problem despite of its contemporary services. Broken links are frequent problem occurring in web domain. Sometimes the page which was pointing from another page has been disappeared forever or moved to some other location. There are numerous reasons behind broken links. Some of these are permanently deleted Web pages, or modification made in Web pages causes broken links or the link of target page has some errors in code of source page. Researchers proposed several techniques in order to recover the broken links or at least retrieve some relevant pages. Number of sources have been used in research community for broken links recover like URL of target page, Anchor text, surround text near to anchor text and text in the source pages. All these sources of information are useful for retrieving the candidate pages relevant to broken links. System returns a ranked list of highly relevant candidate pages on submitting a query which has been extracted from different sources listed above. Previous work relies on TF (Term Frequency) or DF (Document Frequency) weights for extracting term from anchor text and full text of page containing missing links but not showed good results which cause the problem of retrieving similar pages for multiple broken links. In this paper we investigate the use of term proximity (position) relationship between the terms of anchor text and full text in order to extract relevant (good and bad) terms through classification model. This solves the problem by providing different query terms for multiple broken links and also increases the effectiveness as the terms that are proximity close to each other reveal more relevance.

查看原文本刊更多论文

利用信息检索技术自动恢复断开的链接

万维网本质上是动态的，我们每天都在经历网页的变化。Web页面被更新、删除、创建或从一个域移动到另一个域。由于其动态性，网络用户经常会遇到断开的链接。尽管互联网提供了现代化的服务，但它一直饱受断链问题的困扰。断链是网络领域中经常出现的问题。有时从另一页指向的页面已经永远消失或移动到其他位置。断开链接背后有很多原因。其中一些是永久删除的网页，或者在网页中进行的修改导致链接断开，或者目标页面的链接在源页面的代码中有一些错误。研究人员提出了几种技术，以恢复断开的链接或至少检索一些相关的页面。在研究社区中，已经使用了许多源来恢复破碎的链接，如目标页面的URL，锚文本，锚文本附近的环绕文本和源页面中的文本。所有这些信息源对于检索与断开链接相关的候选页面都很有用。系统在提交查询时返回高度相关的候选页面的排名列表，该查询已从上面列出的不同来源提取。以前的工作依赖于TF (Term Frequency)或DF (Document Frequency)权重从包含缺失链接的页面的锚文本和全文中提取术语，但结果并不好，这导致了从多个断开链接中检索相似页面的问题。在本文中，我们研究了利用锚文本和全文之间的术语接近(位置)关系，通过分类模型提取相关的(好的和坏的)术语。这通过为多个断开的链接提供不同的查询词来解决问题，并且还提高了效率，因为彼此接近的词显示出更多的相关性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2nd International Conference on Natural Language Processing and Information Retrieval

自引率

0.00%

发文量