{"title":"网页关键词的语言独立提取","authors":"H. Shah, R. Mariescu-Istodor, P. Fränti","doi":"10.1109/PIC53636.2021.9687047","DOIUrl":null,"url":null,"abstract":"We present a supervised method for keyword extraction from webpages. The method divides the HTML page into meaningful segments using document object model (DOM) and calculates a language independent feature vector for each word. Based on these, we generate a classification model that gives a likelihood for a word to be a keyword. The most likely words are then selected. We analyze the usefulness of the features on different datasets (news articles and service web pages) and compare different classification methods for the task. Results show that random forest performs best and provides up to 27.8 %- unit improvement compared to the best existing method.","PeriodicalId":297239,"journal":{"name":"2021 IEEE International Conference on Progress in Informatics and Computing (PIC)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"WebRank: Language-Independent Extraction of Keywords from Webpages\",\"authors\":\"H. Shah, R. Mariescu-Istodor, P. Fränti\",\"doi\":\"10.1109/PIC53636.2021.9687047\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We present a supervised method for keyword extraction from webpages. The method divides the HTML page into meaningful segments using document object model (DOM) and calculates a language independent feature vector for each word. Based on these, we generate a classification model that gives a likelihood for a word to be a keyword. The most likely words are then selected. We analyze the usefulness of the features on different datasets (news articles and service web pages) and compare different classification methods for the task. Results show that random forest performs best and provides up to 27.8 %- unit improvement compared to the best existing method.\",\"PeriodicalId\":297239,\"journal\":{\"name\":\"2021 IEEE International Conference on Progress in Informatics and Computing (PIC)\",\"volume\":\"10 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2021-12-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2021 IEEE International Conference on Progress in Informatics and Computing (PIC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/PIC53636.2021.9687047\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Conference on Progress in Informatics and Computing (PIC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PIC53636.2021.9687047","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
WebRank: Language-Independent Extraction of Keywords from Webpages
We present a supervised method for keyword extraction from webpages. The method divides the HTML page into meaningful segments using document object model (DOM) and calculates a language independent feature vector for each word. Based on these, we generate a classification model that gives a likelihood for a word to be a keyword. The most likely words are then selected. We analyze the usefulness of the features on different datasets (news articles and service web pages) and compare different classification methods for the task. Results show that random forest performs best and provides up to 27.8 %- unit improvement compared to the best existing method.