{"title":"san_sim: Factual and efficient URL text similarity algorithm","authors":"Sandhya Pundhir, Udayan Ghose","doi":"10.1109/ICATCCT.2017.8389161","DOIUrl":null,"url":null,"abstract":"Similarity determines the relation between two objects. We need this to establish an order between the two objects being compared. Here we want to compare two URLs (Uniform Resource Locater) and find which is more relevant to the input query. Content mining is one of web mining technique which uses text of the web page. Online learning is used where entire dataset cannot be used at training time because of its size. Here few popular text similarity methods are implemented and their relevance is compared with our proposed method. We find that our algo-rithmperforms better than the traditional tex similarity measures such as LCS (Longest Common Sequence) and Dice score. Performance of our proposed method is better as higher Precision, Recall and F measures are achieved. This proves that data specific filtering methods, online learning principles when used with statistical method produces better result.","PeriodicalId":123050,"journal":{"name":"2017 3rd International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT)","volume":"85 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 3rd International Conference on Applied and Theoretical Computing and Communication Technology (iCATccT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICATCCT.2017.8389161","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Similarity determines the relation between two objects. We need this to establish an order between the two objects being compared. Here we want to compare two URLs (Uniform Resource Locater) and find which is more relevant to the input query. Content mining is one of web mining technique which uses text of the web page. Online learning is used where entire dataset cannot be used at training time because of its size. Here few popular text similarity methods are implemented and their relevance is compared with our proposed method. We find that our algo-rithmperforms better than the traditional tex similarity measures such as LCS (Longest Common Sequence) and Dice score. Performance of our proposed method is better as higher Precision, Recall and F measures are achieved. This proves that data specific filtering methods, online learning principles when used with statistical method produces better result.