基于推理的爬虫

2020 Indo – Taiwan 2nd International Conference on Computing, Analytics and Networks (Indo-Taiwan ICAN) Pub Date : 2020-02-01 DOI:10.1109/Indo-TaiwanICAN48429.2020.9181364

P. Hegade, R. Shilpa, Pratiksha Aigal, Swati Pai, Priyanka Shejekar

{"title":"基于推理的爬虫","authors":"P. Hegade, R. Shilpa, Pratiksha Aigal, Swati Pai, Priyanka Shejekar","doi":"10.1109/Indo-TaiwanICAN48429.2020.9181364","DOIUrl":null,"url":null,"abstract":"With a million new pages getting added every single day, the already gigantic web is growing exponentially. While it challenges the search engine and traditional information retrieval methods in producing the relevant results, so does the crawler, which does the background job of traversing the web with hyperlink structure to obtain the web snapshot. The traditional crawlers throw challenges of maintaining the right traversal data structure and tracking the already visited pages. Contemporary applications require context and domain-specific crawlers that harvest the right set of pages and data. A focused crawler needs to have domain-specific evaluation parameters to evaluate and crawl the right set of pages based on relevance. In this paper, we propose a novel model – Crawler by Inference to achieve the said objectives using semantic similarity, paradigmatic similarity, and rules of inference. The proposed methodology prioritizes the links based on the number of new rules built or discovered. The model proposes an efficient data structure - an intelligent queue, which holds the links on a priority basis. The resulting analysis data of a page can also act as a meta-data of the page. The paper also presents the results in comparison with the traditional crawler. The model promises to produce better results by avoiding the crawl of irrelevant pages.","PeriodicalId":171125,"journal":{"name":"2020 Indo – Taiwan 2nd International Conference on Computing, Analytics and Networks (Indo-Taiwan ICAN)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Crawler by Inference\",\"authors\":\"P. Hegade, R. Shilpa, Pratiksha Aigal, Swati Pai, Priyanka Shejekar\",\"doi\":\"10.1109/Indo-TaiwanICAN48429.2020.9181364\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With a million new pages getting added every single day, the already gigantic web is growing exponentially. While it challenges the search engine and traditional information retrieval methods in producing the relevant results, so does the crawler, which does the background job of traversing the web with hyperlink structure to obtain the web snapshot. The traditional crawlers throw challenges of maintaining the right traversal data structure and tracking the already visited pages. Contemporary applications require context and domain-specific crawlers that harvest the right set of pages and data. A focused crawler needs to have domain-specific evaluation parameters to evaluate and crawl the right set of pages based on relevance. In this paper, we propose a novel model – Crawler by Inference to achieve the said objectives using semantic similarity, paradigmatic similarity, and rules of inference. The proposed methodology prioritizes the links based on the number of new rules built or discovered. The model proposes an efficient data structure - an intelligent queue, which holds the links on a priority basis. The resulting analysis data of a page can also act as a meta-data of the page. The paper also presents the results in comparison with the traditional crawler. The model promises to produce better results by avoiding the crawl of irrelevant pages.\",\"PeriodicalId\":171125,\"journal\":{\"name\":\"2020 Indo – Taiwan 2nd International Conference on Computing, Analytics and Networks (Indo-Taiwan ICAN)\",\"volume\":\"19 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-02-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 Indo – Taiwan 2nd International Conference on Computing, Analytics and Networks (Indo-Taiwan ICAN)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/Indo-TaiwanICAN48429.2020.9181364\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 Indo – Taiwan 2nd International Conference on Computing, Analytics and Networks (Indo-Taiwan ICAN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/Indo-TaiwanICAN48429.2020.9181364","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

每天都有一百万个新页面被添加进来，这个已经庞大的网络正在呈指数级增长。它在产生相关结果方面挑战了搜索引擎和传统的信息检索方法，同时也挑战了爬行程序，它在后台以超链接结构遍历web以获取web快照。传统的爬虫程序在维护正确的遍历数据结构和跟踪已访问的页面方面面临挑战。当代应用程序需要特定于上下文和领域的爬虫来获取正确的页面和数据集。有重点的爬虫需要具有特定于领域的评估参数，以便根据相关性评估和抓取正确的页面集。在本文中，我们提出了一种新的模型-基于推理的爬虫来实现上述目标，该模型利用语义相似性、范式相似性和推理规则。提出的方法根据构建或发现的新规则的数量对链接进行优先级排序。该模型提出了一种高效的数据结构——智能队列，它按优先级保存链路。页面的结果分析数据也可以作为页面的元数据。文中还给出了与传统爬虫的比较结果。该模型承诺通过避免抓取不相关页面来产生更好的结果。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Crawler by Inference

With a million new pages getting added every single day, the already gigantic web is growing exponentially. While it challenges the search engine and traditional information retrieval methods in producing the relevant results, so does the crawler, which does the background job of traversing the web with hyperlink structure to obtain the web snapshot. The traditional crawlers throw challenges of maintaining the right traversal data structure and tracking the already visited pages. Contemporary applications require context and domain-specific crawlers that harvest the right set of pages and data. A focused crawler needs to have domain-specific evaluation parameters to evaluate and crawl the right set of pages based on relevance. In this paper, we propose a novel model – Crawler by Inference to achieve the said objectives using semantic similarity, paradigmatic similarity, and rules of inference. The proposed methodology prioritizes the links based on the number of new rules built or discovered. The model proposes an efficient data structure - an intelligent queue, which holds the links on a priority basis. The resulting analysis data of a page can also act as a meta-data of the page. The paper also presents the results in comparison with the traditional crawler. The model promises to produce better results by avoiding the crawl of irrelevant pages.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 Indo – Taiwan 2nd International Conference on Computing, Analytics and Networks (Indo-Taiwan ICAN)

自引率

0.00%

发文量