{"title":"SAFSB: A self-adaptive focused crawler","authors":"D. Sharma, Mohd. Aamir Khan","doi":"10.1109/NGCT.2015.7375215","DOIUrl":null,"url":null,"abstract":"There are about 3 billion indexed websites present in the WWW. Not all websites do not belong to a particular topic are indexed by a search engine say google.com, there are online platforms available where different users help the person asking for a (Universal Resource Locator) URL containing a topical information. To verify the authenticity and validity of the URL, an empirical methodology and its ranking to major its relevancy is presented through this paper. To semantically expand the search, topic ontology is used for the pre-processing of the focused crawler to make search more effective. The performance of our web crawler is further increased by using the ontology based learning which is constantly being updated by dictionary based learning and related words of the named entities. The harvest ratio is used which represents the ratio between the relevant pages and the crawled pages shows a significant improvement than the previous methods.","PeriodicalId":216294,"journal":{"name":"2015 1st International Conference on Next Generation Computing Technologies (NGCT)","volume":"87 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 1st International Conference on Next Generation Computing Technologies (NGCT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/NGCT.2015.7375215","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
There are about 3 billion indexed websites present in the WWW. Not all websites do not belong to a particular topic are indexed by a search engine say google.com, there are online platforms available where different users help the person asking for a (Universal Resource Locator) URL containing a topical information. To verify the authenticity and validity of the URL, an empirical methodology and its ranking to major its relevancy is presented through this paper. To semantically expand the search, topic ontology is used for the pre-processing of the focused crawler to make search more effective. The performance of our web crawler is further increased by using the ontology based learning which is constantly being updated by dictionary based learning and related words of the named entities. The harvest ratio is used which represents the ratio between the relevant pages and the crawled pages shows a significant improvement than the previous methods.