{"title":"Computing Correlative Association of Terms for Automatic Classification of Text Documents","authors":"Deepak Agnihotri, K. Verma, Priyanka Tripathi","doi":"10.1145/2983402.2983424","DOIUrl":null,"url":null,"abstract":"The selection of most informative terms reduces the feature set and speed up the classification process. The most informative terms are highly affected by the correlative association of the terms. The rare terms are most informative than sparse and common terms. The main objective of this study is assigning a higher weight to the rare terms and less weight to the common and sparse terms. The terms weight are computed by giving emphasis on terms- strength, mutual information and strong association with the specific class. In this context, we proposed, a novel hybrid feature selection method named as, Correlative Association Score (CAS) of terms. The CAS utilizes the concept of Apriori algorithm to select the most informative terms. Initially, the CAS select most informative terms from the entire extracted terms. Subsequently, the N-grams of range (1,3) are generated from these informative terms. Finally, the standard Chi Square (χ2) method is applied to select most informative N-grams. The two standard classifiers Multinomial Naive Bayes (MNB) and Linear Support Vector Machine (LSVM) are applied on four standard text data sets Webkb, 20Newsgroup, Ohsumed10, and Ohsumed23. The promising results of extensive experiments demonstrate the effectiveness of the CAS in compared to state-of-the-art methods viz. Mutual Information (MI), Information Gain (IG), Discriminating Feature Selection (DFS), and χ2.","PeriodicalId":283626,"journal":{"name":"Proceedings of the Third International Symposium on Computer Vision and the Internet","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Third International Symposium on Computer Vision and the Internet","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2983402.2983424","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 11
Abstract
The selection of most informative terms reduces the feature set and speed up the classification process. The most informative terms are highly affected by the correlative association of the terms. The rare terms are most informative than sparse and common terms. The main objective of this study is assigning a higher weight to the rare terms and less weight to the common and sparse terms. The terms weight are computed by giving emphasis on terms- strength, mutual information and strong association with the specific class. In this context, we proposed, a novel hybrid feature selection method named as, Correlative Association Score (CAS) of terms. The CAS utilizes the concept of Apriori algorithm to select the most informative terms. Initially, the CAS select most informative terms from the entire extracted terms. Subsequently, the N-grams of range (1,3) are generated from these informative terms. Finally, the standard Chi Square (χ2) method is applied to select most informative N-grams. The two standard classifiers Multinomial Naive Bayes (MNB) and Linear Support Vector Machine (LSVM) are applied on four standard text data sets Webkb, 20Newsgroup, Ohsumed10, and Ohsumed23. The promising results of extensive experiments demonstrate the effectiveness of the CAS in compared to state-of-the-art methods viz. Mutual Information (MI), Information Gain (IG), Discriminating Feature Selection (DFS), and χ2.