{"title":"A new term weighting scheme based on class specific document frequency for document representation and classification","authors":"Suthira Plansangket, J. Q. Gan","doi":"10.1109/CEEC.2015.7332690","DOIUrl":null,"url":null,"abstract":"Document classification is usually more challenging than numerical data classification, because it is much more difficult to effectively represent documents than numerical data for classification purposes. Vector space model (VSM) has been widely used for document representation for classification, in which a document is represented by a vector of feature values based on a bag of words. This paper proposes a new feature for document representation under the VSM framework, class specific document frequency (CSDF), which leads to a novel term weighting scheme based on term frequency (TF), term presence (TP), and the newly proposed feature. The experimental results show that the proposed features, CSDF and TF-CSDF, effectively improve the performance of document classification in comparison with other widely used VSM document representations.","PeriodicalId":294036,"journal":{"name":"2015 7th Computer Science and Electronic Engineering Conference (CEEC)","volume":"143 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-11-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 7th Computer Science and Electronic Engineering Conference (CEEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CEEC.2015.7332690","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 9
Abstract
Document classification is usually more challenging than numerical data classification, because it is much more difficult to effectively represent documents than numerical data for classification purposes. Vector space model (VSM) has been widely used for document representation for classification, in which a document is represented by a vector of feature values based on a bag of words. This paper proposes a new feature for document representation under the VSM framework, class specific document frequency (CSDF), which leads to a novel term weighting scheme based on term frequency (TF), term presence (TP), and the newly proposed feature. The experimental results show that the proposed features, CSDF and TF-CSDF, effectively improve the performance of document classification in comparison with other widely used VSM document representations.