Feature selection methods for document clustering: a comparative study and a hybrid solution

Q4 Mathematics

International Journal of Data Analysis Techniques and Strategies Pub Date : 2019-07-09 DOI:10.1504/IJDATS.2019.10022545

Asmaa Benghabrit, B. Ouhbi, B. Frikh, E. Zemmouri, Hicham Behja

{"title":"Feature selection methods for document clustering: a comparative study and a hybrid solution","authors":"Asmaa Benghabrit, B. Ouhbi, B. Frikh, E. Zemmouri, Hicham Behja","doi":"10.1504/IJDATS.2019.10022545","DOIUrl":null,"url":null,"abstract":"The web proliferation makes the exploration and the use of the huge amount of available unstructured text documents challenged, which drives the need of document clustering. Hence, improving the performances of this mechanism by using feature selection seems worth investigation. Therefore, this paper proposes an efficient way to highly benefit from feature selection for document clustering. We first present a review and comparative studies of feature selection methods in order to extract efficient ones. Then we propose a sequential and hybrid combination modes of statistical and semantic techniques in order to benefit from crucial information that each of them provides for document clustering. Extensive experiments prove the benefit of the proposed combination approaches. The performance of document clustering is highest when the measures based on Chi-square statistic and the mutual information are linearly combined. Doing so, it avoids the unwanted correlation that the sequential approach creates between the two treatments.","PeriodicalId":38582,"journal":{"name":"International Journal of Data Analysis Techniques and Strategies","volume":"50 1","pages":"246-272"},"PeriodicalIF":0.0000,"publicationDate":"2019-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Data Analysis Techniques and Strategies","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1504/IJDATS.2019.10022545","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"Mathematics","Score":null,"Total":0}

引用次数: 1

Abstract

The web proliferation makes the exploration and the use of the huge amount of available unstructured text documents challenged, which drives the need of document clustering. Hence, improving the performances of this mechanism by using feature selection seems worth investigation. Therefore, this paper proposes an efficient way to highly benefit from feature selection for document clustering. We first present a review and comparative studies of feature selection methods in order to extract efficient ones. Then we propose a sequential and hybrid combination modes of statistical and semantic techniques in order to benefit from crucial information that each of them provides for document clustering. Extensive experiments prove the benefit of the proposed combination approaches. The performance of document clustering is highest when the measures based on Chi-square statistic and the mutual information are linearly combined. Doing so, it avoids the unwanted correlation that the sequential approach creates between the two treatments.

查看原文本刊更多论文

文档聚类的特征选择方法:比较研究和混合解决方案

随着网络的发展，对大量可用的非结构化文本文档的探索和利用面临挑战，这就产生了对文档聚类的需求。因此，通过使用特征选择来提高该机制的性能似乎值得研究。因此，本文提出了一种有效的方法，可以在文档聚类中充分利用特征选择。为了提取有效的特征选择方法，我们首先对特征选择方法进行了综述和比较研究。然后，我们提出了统计和语义技术的顺序和混合组合模式，以便从它们各自为文档聚类提供的关键信息中获益。大量的实验证明了所提出的组合方法的有效性。当基于卡方统计量和互信息的测度线性结合时，聚类的性能最高。这样做，它避免了顺序方法在两种处理之间创建的不必要的相关性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Journal of Data Analysis Techniques and Strategies Decision Sciences-Information Systems and Management

CiteScore

1.20

自引率

0.00%

发文量