Investigation of Latent Semantic Analysis for Clustering of Czech News Articles

2014 25th International Workshop on Database and Expert Systems Applications Pub Date : 2014-12-04 DOI:10.1109/DEXA.2014.54

Michal Rott, P. Cerva

引用次数: 4

Abstract

This paper studies the use of Latent Semantic Analysis (LSA) for automatic clustering of Czech news articles. We show that LSA is capable of yielding good results in this task as it allows us to reduce the problem of synonymy. This is a very important factor particularly for Czech, which belongs to a group of highly inflective and morphologicallyrich languages. The experimental evaluation of our clustering scheme and investigation of LSA is performed on query-and category-based test sets. The obtained results demonstrate that the automatic system yields values of the Rand index that are absolutely lower -- by 20% -- than the accuracy of human cluster annotations. We also show which similarity metric should be used for cluster merging and the effect of dimension reduction on clustering accuracy.

查看原文本刊更多论文

捷克语新闻文章聚类的潜在语义分析研究

本文研究了潜在语义分析(LSA)在捷克语新闻文章自动聚类中的应用。我们证明了LSA能够在这个任务中产生很好的结果，因为它允许我们减少同义词问题。这是一个非常重要的因素，特别是对捷克语来说，它属于一组高度屈折和词形丰富的语言。在基于查询和基于类别的测试集上对我们的聚类方案进行了实验评估，并对LSA进行了研究。获得的结果表明，自动系统产生的Rand指数值比人工聚类注释的准确性低20%。我们还展示了聚类合并应该使用哪些相似度度量，以及降维对聚类精度的影响。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 25th International Workshop on Database and Expert Systems Applications

自引率

0.00%

发文量