{"title":"WebRank: Language-Independent Extraction of Keywords from Webpages","authors":"H. Shah, R. Mariescu-Istodor, P. Fränti","doi":"10.1109/PIC53636.2021.9687047","DOIUrl":null,"url":null,"abstract":"We present a supervised method for keyword extraction from webpages. The method divides the HTML page into meaningful segments using document object model (DOM) and calculates a language independent feature vector for each word. Based on these, we generate a classification model that gives a likelihood for a word to be a keyword. The most likely words are then selected. We analyze the usefulness of the features on different datasets (news articles and service web pages) and compare different classification methods for the task. Results show that random forest performs best and provides up to 27.8 %- unit improvement compared to the best existing method.","PeriodicalId":297239,"journal":{"name":"2021 IEEE International Conference on Progress in Informatics and Computing (PIC)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Conference on Progress in Informatics and Computing (PIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PIC53636.2021.9687047","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
We present a supervised method for keyword extraction from webpages. The method divides the HTML page into meaningful segments using document object model (DOM) and calculates a language independent feature vector for each word. Based on these, we generate a classification model that gives a likelihood for a word to be a keyword. The most likely words are then selected. We analyze the usefulness of the features on different datasets (news articles and service web pages) and compare different classification methods for the task. Results show that random forest performs best and provides up to 27.8 %- unit improvement compared to the best existing method.