{"title":"Modeling Website Topic Cohesion at Scale to Improve Webpage Classification","authors":"D. Eswaran, Paul N. Bennett, Joseph J. Pfeiffer","doi":"10.1145/2766462.2767834","DOIUrl":null,"url":null,"abstract":"Considerable work in web page classification has focused on incorporating the topical structure of the web (e.g., the hyperlink graph) to improve prediction accuracy. However, the majority of work has primarily focused on relational or graph-based methods that are impractical to run at scale or in an online environment. This raises the question of whether it is possible to leverage the topical structure of the web while incurring nearly no additional prediction-time cost. To this end, we introduce an approach which adjusts a page content-only classification from that obtained with a global prior to the posterior obtained by incorporating a prior which reflects the topic cohesion of the site. Using ODP data, we empirically demonstrate that our approach yields significant performance increases over a range of topics.","PeriodicalId":297035,"journal":{"name":"Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-08-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 38th International ACM SIGIR Conference on Research and Development in Information Retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2766462.2767834","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Considerable work in web page classification has focused on incorporating the topical structure of the web (e.g., the hyperlink graph) to improve prediction accuracy. However, the majority of work has primarily focused on relational or graph-based methods that are impractical to run at scale or in an online environment. This raises the question of whether it is possible to leverage the topical structure of the web while incurring nearly no additional prediction-time cost. To this end, we introduce an approach which adjusts a page content-only classification from that obtained with a global prior to the posterior obtained by incorporating a prior which reflects the topic cohesion of the site. Using ODP data, we empirically demonstrate that our approach yields significant performance increases over a range of topics.