P. Hegade, R. Shilpa, Pratiksha Aigal, Swati Pai, Priyanka Shejekar
{"title":"基于推理的爬虫","authors":"P. Hegade, R. Shilpa, Pratiksha Aigal, Swati Pai, Priyanka Shejekar","doi":"10.1109/Indo-TaiwanICAN48429.2020.9181364","DOIUrl":null,"url":null,"abstract":"With a million new pages getting added every single day, the already gigantic web is growing exponentially. While it challenges the search engine and traditional information retrieval methods in producing the relevant results, so does the crawler, which does the background job of traversing the web with hyperlink structure to obtain the web snapshot. The traditional crawlers throw challenges of maintaining the right traversal data structure and tracking the already visited pages. Contemporary applications require context and domain-specific crawlers that harvest the right set of pages and data. A focused crawler needs to have domain-specific evaluation parameters to evaluate and crawl the right set of pages based on relevance. In this paper, we propose a novel model – Crawler by Inference to achieve the said objectives using semantic similarity, paradigmatic similarity, and rules of inference. The proposed methodology prioritizes the links based on the number of new rules built or discovered. The model proposes an efficient data structure - an intelligent queue, which holds the links on a priority basis. The resulting analysis data of a page can also act as a meta-data of the page. The paper also presents the results in comparison with the traditional crawler. The model promises to produce better results by avoiding the crawl of irrelevant pages.","PeriodicalId":171125,"journal":{"name":"2020 Indo – Taiwan 2nd International Conference on Computing, Analytics and Networks (Indo-Taiwan ICAN)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Crawler by Inference\",\"authors\":\"P. Hegade, R. Shilpa, Pratiksha Aigal, Swati Pai, Priyanka Shejekar\",\"doi\":\"10.1109/Indo-TaiwanICAN48429.2020.9181364\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"With a million new pages getting added every single day, the already gigantic web is growing exponentially. While it challenges the search engine and traditional information retrieval methods in producing the relevant results, so does the crawler, which does the background job of traversing the web with hyperlink structure to obtain the web snapshot. The traditional crawlers throw challenges of maintaining the right traversal data structure and tracking the already visited pages. Contemporary applications require context and domain-specific crawlers that harvest the right set of pages and data. A focused crawler needs to have domain-specific evaluation parameters to evaluate and crawl the right set of pages based on relevance. In this paper, we propose a novel model – Crawler by Inference to achieve the said objectives using semantic similarity, paradigmatic similarity, and rules of inference. The proposed methodology prioritizes the links based on the number of new rules built or discovered. The model proposes an efficient data structure - an intelligent queue, which holds the links on a priority basis. The resulting analysis data of a page can also act as a meta-data of the page. The paper also presents the results in comparison with the traditional crawler. The model promises to produce better results by avoiding the crawl of irrelevant pages.\",\"PeriodicalId\":171125,\"journal\":{\"name\":\"2020 Indo – Taiwan 2nd International Conference on Computing, Analytics and Networks (Indo-Taiwan ICAN)\",\"volume\":\"19 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-02-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 Indo – Taiwan 2nd International Conference on Computing, Analytics and Networks (Indo-Taiwan ICAN)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/Indo-TaiwanICAN48429.2020.9181364\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 Indo – Taiwan 2nd International Conference on Computing, Analytics and Networks (Indo-Taiwan ICAN)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/Indo-TaiwanICAN48429.2020.9181364","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
With a million new pages getting added every single day, the already gigantic web is growing exponentially. While it challenges the search engine and traditional information retrieval methods in producing the relevant results, so does the crawler, which does the background job of traversing the web with hyperlink structure to obtain the web snapshot. The traditional crawlers throw challenges of maintaining the right traversal data structure and tracking the already visited pages. Contemporary applications require context and domain-specific crawlers that harvest the right set of pages and data. A focused crawler needs to have domain-specific evaluation parameters to evaluate and crawl the right set of pages based on relevance. In this paper, we propose a novel model – Crawler by Inference to achieve the said objectives using semantic similarity, paradigmatic similarity, and rules of inference. The proposed methodology prioritizes the links based on the number of new rules built or discovered. The model proposes an efficient data structure - an intelligent queue, which holds the links on a priority basis. The resulting analysis data of a page can also act as a meta-data of the page. The paper also presents the results in comparison with the traditional crawler. The model promises to produce better results by avoiding the crawl of irrelevant pages.