Computing Correlative Association of Terms for Automatic Classification of Text Documents

Deepak Agnihotri, K. Verma, Priyanka Tripathi
{"title":"Computing Correlative Association of Terms for Automatic Classification of Text Documents","authors":"Deepak Agnihotri, K. Verma, Priyanka Tripathi","doi":"10.1145/2983402.2983424","DOIUrl":null,"url":null,"abstract":"The selection of most informative terms reduces the feature set and speed up the classification process. The most informative terms are highly affected by the correlative association of the terms. The rare terms are most informative than sparse and common terms. The main objective of this study is assigning a higher weight to the rare terms and less weight to the common and sparse terms. The terms weight are computed by giving emphasis on terms- strength, mutual information and strong association with the specific class. In this context, we proposed, a novel hybrid feature selection method named as, Correlative Association Score (CAS) of terms. The CAS utilizes the concept of Apriori algorithm to select the most informative terms. Initially, the CAS select most informative terms from the entire extracted terms. Subsequently, the N-grams of range (1,3) are generated from these informative terms. Finally, the standard Chi Square (χ2) method is applied to select most informative N-grams. The two standard classifiers Multinomial Naive Bayes (MNB) and Linear Support Vector Machine (LSVM) are applied on four standard text data sets Webkb, 20Newsgroup, Ohsumed10, and Ohsumed23. The promising results of extensive experiments demonstrate the effectiveness of the CAS in compared to state-of-the-art methods viz. Mutual Information (MI), Information Gain (IG), Discriminating Feature Selection (DFS), and χ2.","PeriodicalId":283626,"journal":{"name":"Proceedings of the Third International Symposium on Computer Vision and the Internet","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the Third International Symposium on Computer Vision and the Internet","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2983402.2983424","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 11

Abstract

The selection of most informative terms reduces the feature set and speed up the classification process. The most informative terms are highly affected by the correlative association of the terms. The rare terms are most informative than sparse and common terms. The main objective of this study is assigning a higher weight to the rare terms and less weight to the common and sparse terms. The terms weight are computed by giving emphasis on terms- strength, mutual information and strong association with the specific class. In this context, we proposed, a novel hybrid feature selection method named as, Correlative Association Score (CAS) of terms. The CAS utilizes the concept of Apriori algorithm to select the most informative terms. Initially, the CAS select most informative terms from the entire extracted terms. Subsequently, the N-grams of range (1,3) are generated from these informative terms. Finally, the standard Chi Square (χ2) method is applied to select most informative N-grams. The two standard classifiers Multinomial Naive Bayes (MNB) and Linear Support Vector Machine (LSVM) are applied on four standard text data sets Webkb, 20Newsgroup, Ohsumed10, and Ohsumed23. The promising results of extensive experiments demonstrate the effectiveness of the CAS in compared to state-of-the-art methods viz. Mutual Information (MI), Information Gain (IG), Discriminating Feature Selection (DFS), and χ2.
用于文本文档自动分类的术语关联计算
选择信息量最大的术语减少了特征集,加快了分类过程。信息量最大的术语受到相关术语关联的高度影响。罕见的术语比稀疏的和常见的术语更能提供信息。本研究的主要目的是为罕见项分配更高的权重,为常见项和稀疏项分配更少的权重。通过强调术语强度、相互信息和与特定类的强关联来计算术语权重。在此背景下,我们提出了一种新的混合特征选择方法,即术语的相关关联评分(CAS)。CAS利用Apriori算法的概念来选择信息量最大的术语。最初,CAS从全部提取的术语中选择信息量最大的术语。随后,由这些信息项生成范围为(1,3)的n个图。最后,采用标准卡方(χ2)方法选择信息量最大的n -图。将多项式朴素贝叶斯(MNB)和线性支持向量机(LSVM)两个标准分类器应用于Webkb、20Newsgroup、Ohsumed10和Ohsumed23四个标准文本数据集。广泛的实验结果表明,与最先进的方法相比,CAS的有效性,即互信息(MI),信息增益(IG),判别特征选择(DFS)和χ2。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信