Feature selection methods for document clustering: a comparative study and a hybrid solution

Q4 Mathematics
Asmaa Benghabrit, B. Ouhbi, B. Frikh, E. Zemmouri, Hicham Behja
{"title":"Feature selection methods for document clustering: a comparative study and a hybrid solution","authors":"Asmaa Benghabrit, B. Ouhbi, B. Frikh, E. Zemmouri, Hicham Behja","doi":"10.1504/IJDATS.2019.10022545","DOIUrl":null,"url":null,"abstract":"The web proliferation makes the exploration and the use of the huge amount of available unstructured text documents challenged, which drives the need of document clustering. Hence, improving the performances of this mechanism by using feature selection seems worth investigation. Therefore, this paper proposes an efficient way to highly benefit from feature selection for document clustering. We first present a review and comparative studies of feature selection methods in order to extract efficient ones. Then we propose a sequential and hybrid combination modes of statistical and semantic techniques in order to benefit from crucial information that each of them provides for document clustering. Extensive experiments prove the benefit of the proposed combination approaches. The performance of document clustering is highest when the measures based on Chi-square statistic and the mutual information are linearly combined. Doing so, it avoids the unwanted correlation that the sequential approach creates between the two treatments.","PeriodicalId":38582,"journal":{"name":"International Journal of Data Analysis Techniques and Strategies","volume":"50 1","pages":"246-272"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Data Analysis Techniques and Strategies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1504/IJDATS.2019.10022545","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Mathematics","Score":null,"Total":0}
引用次数: 1

Abstract

The web proliferation makes the exploration and the use of the huge amount of available unstructured text documents challenged, which drives the need of document clustering. Hence, improving the performances of this mechanism by using feature selection seems worth investigation. Therefore, this paper proposes an efficient way to highly benefit from feature selection for document clustering. We first present a review and comparative studies of feature selection methods in order to extract efficient ones. Then we propose a sequential and hybrid combination modes of statistical and semantic techniques in order to benefit from crucial information that each of them provides for document clustering. Extensive experiments prove the benefit of the proposed combination approaches. The performance of document clustering is highest when the measures based on Chi-square statistic and the mutual information are linearly combined. Doing so, it avoids the unwanted correlation that the sequential approach creates between the two treatments.
文档聚类的特征选择方法:比较研究和混合解决方案
随着网络的发展,对大量可用的非结构化文本文档的探索和利用面临挑战,这就产生了对文档聚类的需求。因此,通过使用特征选择来提高该机制的性能似乎值得研究。因此,本文提出了一种有效的方法,可以在文档聚类中充分利用特征选择。为了提取有效的特征选择方法,我们首先对特征选择方法进行了综述和比较研究。然后,我们提出了统计和语义技术的顺序和混合组合模式,以便从它们各自为文档聚类提供的关键信息中获益。大量的实验证明了所提出的组合方法的有效性。当基于卡方统计量和互信息的测度线性结合时,聚类的性能最高。这样做,它避免了顺序方法在两种处理之间创建的不必要的相关性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
International Journal of Data Analysis Techniques and Strategies
International Journal of Data Analysis Techniques and Strategies Decision Sciences-Information Systems and Management
CiteScore
1.20
自引率
0.00%
发文量
21
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信