Computing Correlative Association of Terms for Automatic Classification of Text Documents

Proceedings of the Third International Symposium on Computer Vision and the Internet Pub Date : 2016-09-21 DOI:10.1145/2983402.2983424

Deepak Agnihotri, K. Verma, Priyanka Tripathi

{"title":"Computing Correlative Association of Terms for Automatic Classification of Text Documents","authors":"Deepak Agnihotri, K. Verma, Priyanka Tripathi","doi":"10.1145/2983402.2983424","DOIUrl":null,"url":null,"abstract":"The selection of most informative terms reduces the feature set and speed up the classification process. The most informative terms are highly affected by the correlative association of the terms. The rare terms are most informative than sparse and common terms. The main objective of this study is assigning a higher weight to the rare terms and less weight to the common and sparse terms. The terms weight are computed by giving emphasis on terms- strength, mutual information and strong association with the specific class. In this context, we proposed, a novel hybrid feature selection method named as, Correlative Association Score (CAS) of terms. The CAS utilizes the concept of Apriori algorithm to select the most informative terms. Initially, the CAS select most informative terms from the entire extracted terms. Subsequently, the N-grams of range (1,3) are generated from these informative terms. Finally, the standard Chi Square (χ2) method is applied to select most informative N-grams. The two standard classifiers Multinomial Naive Bayes (MNB) and Linear Support Vector Machine (LSVM) are applied on four standard text data sets Webkb, 20Newsgroup, Ohsumed10, and Ohsumed23. The promising results of extensive experiments demonstrate the effectiveness of the CAS in compared to state-of-the-art methods viz. Mutual Information (MI), Information Gain (IG), Discriminating Feature Selection (DFS), and χ2.","PeriodicalId":283626,"journal":{"name":"Proceedings of the Third International Symposium on Computer Vision and the Internet","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Third International Symposium on Computer Vision and the Internet","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2983402.2983424","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

Abstract

The selection of most informative terms reduces the feature set and speed up the classification process. The most informative terms are highly affected by the correlative association of the terms. The rare terms are most informative than sparse and common terms. The main objective of this study is assigning a higher weight to the rare terms and less weight to the common and sparse terms. The terms weight are computed by giving emphasis on terms- strength, mutual information and strong association with the specific class. In this context, we proposed, a novel hybrid feature selection method named as, Correlative Association Score (CAS) of terms. The CAS utilizes the concept of Apriori algorithm to select the most informative terms. Initially, the CAS select most informative terms from the entire extracted terms. Subsequently, the N-grams of range (1,3) are generated from these informative terms. Finally, the standard Chi Square (χ2) method is applied to select most informative N-grams. The two standard classifiers Multinomial Naive Bayes (MNB) and Linear Support Vector Machine (LSVM) are applied on four standard text data sets Webkb, 20Newsgroup, Ohsumed10, and Ohsumed23. The promising results of extensive experiments demonstrate the effectiveness of the CAS in compared to state-of-the-art methods viz. Mutual Information (MI), Information Gain (IG), Discriminating Feature Selection (DFS), and χ2.

查看原文本刊更多论文

用于文本文档自动分类的术语关联计算

选择信息量最大的术语减少了特征集，加快了分类过程。信息量最大的术语受到相关术语关联的高度影响。罕见的术语比稀疏的和常见的术语更能提供信息。本研究的主要目的是为罕见项分配更高的权重，为常见项和稀疏项分配更少的权重。通过强调术语强度、相互信息和与特定类的强关联来计算术语权重。在此背景下，我们提出了一种新的混合特征选择方法，即术语的相关关联评分(CAS)。CAS利用Apriori算法的概念来选择信息量最大的术语。最初，CAS从全部提取的术语中选择信息量最大的术语。随后，由这些信息项生成范围为(1,3)的n个图。最后，采用标准卡方(χ2)方法选择信息量最大的n -图。将多项式朴素贝叶斯(MNB)和线性支持向量机(LSVM)两个标准分类器应用于Webkb、20Newsgroup、Ohsumed10和Ohsumed23四个标准文本数据集。广泛的实验结果表明，与最先进的方法相比，CAS的有效性，即互信息(MI)，信息增益(IG)，判别特征选择(DFS)和χ2。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the Third International Symposium on Computer Vision and the Internet

自引率

0.00%

发文量