{"title":"A New Centroid-based Approach for Genre Categorization of Web Pages","authors":"Chaker Jebari","doi":"10.21248/jlcl.24.2009.114","DOIUrl":null,"url":null,"abstract":"In this paper we propose a new centroid-based approach for genre catego rization of web pages. Our approach constructs genre centroids using a set of genre-labeled web pages, called training web pages. The obtained cen troids will be used to classify new web pages. The aim of our approach is to provide a flexible, incremental, refined and combined categorization, which is more suitable for automatic web genre identification. Our approach is flexible because it assigns a web page to all predefined genres with a confi dence score; it is incremental because it classifies web pages one by one; it is refined because each web page either refines the centroids or is discarded as noisy page; finally, our approach combines three dierent feature sets, i.e. URL addresses, logical structure and hypertext structure. The experiments conducted on two known corpora show that our approach is very fast and outperforms other approaches.","PeriodicalId":137584,"journal":{"name":"Journal for Language Technology and Computational Linguistics","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal for Language Technology and Computational Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21248/jlcl.24.2009.114","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
In this paper we propose a new centroid-based approach for genre catego rization of web pages. Our approach constructs genre centroids using a set of genre-labeled web pages, called training web pages. The obtained cen troids will be used to classify new web pages. The aim of our approach is to provide a flexible, incremental, refined and combined categorization, which is more suitable for automatic web genre identification. Our approach is flexible because it assigns a web page to all predefined genres with a confi dence score; it is incremental because it classifies web pages one by one; it is refined because each web page either refines the centroids or is discarded as noisy page; finally, our approach combines three dierent feature sets, i.e. URL addresses, logical structure and hypertext structure. The experiments conducted on two known corpora show that our approach is very fast and outperforms other approaches.