Evolving document features for Web document clustering: a feasibility study

Proceedings of the 2004 Congress on Evolutionary Computation (IEEE Cat. No.04TH8753) Pub Date : 2004-06-19 DOI:10.1109/CEC.2004.1330955

M. P. Sinka, D. Corne

引用次数: 7

Abstract

Document analysis and its associated research underpins Web intelligence and the envisaged 'semantic Web'. A key issue is how to encode a document without losing salient information. Current research almost always uses fixed-length vectors based on word (term) frequency (TF) and/or variants thereof. We explore the question of alternative encodings, and we search for such encodings using an evolutionary algorithm (EA). These alternatives consider a variety of other features that can be extracted from a document, and the EA explores the space of weighted combinations of these. Tests on the BankSearch dataset were able to find encodings which outperformed previous results using TF-based encodings. Among several tentative findings it seems clear that the ideal encoding is highly task-dependent, and we can recommend certain features as useful for specific types of document clustering tasks.

查看原文本刊更多论文

发展Web文档聚类的文档特征:可行性研究

文档分析及其相关研究是网络智能和设想中的“语义网”的基础。一个关键问题是如何在不丢失重要信息的情况下对文档进行编码。目前的研究几乎总是使用基于词(项)频率(TF)和/或其变体的固定长度向量。我们探索了替代编码的问题，并使用进化算法(EA)搜索这样的编码。这些替代方案考虑了可以从文档中提取的各种其他特征，EA探索了这些特征的加权组合空间。对BankSearch数据集的测试能够找到优于先前使用基于tf的编码结果的编码。在几个初步的发现中，很明显理想的编码是高度依赖于任务的，我们可以为特定类型的文档聚类任务推荐一些有用的特性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 2004 Congress on Evolutionary Computation (IEEE Cat. No.04TH8753)

自引率

0.00%

发文量