发展Web文档聚类的文档特征:可行性研究

Proceedings of the 2004 Congress on Evolutionary Computation (IEEE Cat. No.04TH8753) Pub Date : 2004-06-19 DOI:10.1109/CEC.2004.1330955

M. P. Sinka, D. Corne

{"title":"发展Web文档聚类的文档特征:可行性研究","authors":"M. P. Sinka, D. Corne","doi":"10.1109/CEC.2004.1330955","DOIUrl":null,"url":null,"abstract":"Document analysis and its associated research underpins Web intelligence and the envisaged 'semantic Web'. A key issue is how to encode a document without losing salient information. Current research almost always uses fixed-length vectors based on word (term) frequency (TF) and/or variants thereof. We explore the question of alternative encodings, and we search for such encodings using an evolutionary algorithm (EA). These alternatives consider a variety of other features that can be extracted from a document, and the EA explores the space of weighted combinations of these. Tests on the BankSearch dataset were able to find encodings which outperformed previous results using TF-based encodings. Among several tentative findings it seems clear that the ideal encoding is highly task-dependent, and we can recommend certain features as useful for specific types of document clustering tasks.","PeriodicalId":152088,"journal":{"name":"Proceedings of the 2004 Congress on Evolutionary Computation (IEEE Cat. No.04TH8753)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2004-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Evolving document features for Web document clustering: a feasibility study\",\"authors\":\"M. P. Sinka, D. Corne\",\"doi\":\"10.1109/CEC.2004.1330955\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Document analysis and its associated research underpins Web intelligence and the envisaged 'semantic Web'. A key issue is how to encode a document without losing salient information. Current research almost always uses fixed-length vectors based on word (term) frequency (TF) and/or variants thereof. We explore the question of alternative encodings, and we search for such encodings using an evolutionary algorithm (EA). These alternatives consider a variety of other features that can be extracted from a document, and the EA explores the space of weighted combinations of these. Tests on the BankSearch dataset were able to find encodings which outperformed previous results using TF-based encodings. Among several tentative findings it seems clear that the ideal encoding is highly task-dependent, and we can recommend certain features as useful for specific types of document clustering tasks.\",\"PeriodicalId\":152088,\"journal\":{\"name\":\"Proceedings of the 2004 Congress on Evolutionary Computation (IEEE Cat. No.04TH8753)\",\"volume\":\"13 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2004-06-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 2004 Congress on Evolutionary Computation (IEEE Cat. No.04TH8753)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CEC.2004.1330955\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2004 Congress on Evolutionary Computation (IEEE Cat. No.04TH8753)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CEC.2004.1330955","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

摘要

文档分析及其相关研究是网络智能和设想中的“语义网”的基础。一个关键问题是如何在不丢失重要信息的情况下对文档进行编码。目前的研究几乎总是使用基于词(项)频率(TF)和/或其变体的固定长度向量。我们探索了替代编码的问题，并使用进化算法(EA)搜索这样的编码。这些替代方案考虑了可以从文档中提取的各种其他特征，EA探索了这些特征的加权组合空间。对BankSearch数据集的测试能够找到优于先前使用基于tf的编码结果的编码。在几个初步的发现中，很明显理想的编码是高度依赖于任务的，我们可以为特定类型的文档聚类任务推荐一些有用的特性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Evolving document features for Web document clustering: a feasibility study

Document analysis and its associated research underpins Web intelligence and the envisaged 'semantic Web'. A key issue is how to encode a document without losing salient information. Current research almost always uses fixed-length vectors based on word (term) frequency (TF) and/or variants thereof. We explore the question of alternative encodings, and we search for such encodings using an evolutionary algorithm (EA). These alternatives consider a variety of other features that can be extracted from a document, and the EA explores the space of weighted combinations of these. Tests on the BankSearch dataset were able to find encodings which outperformed previous results using TF-based encodings. Among several tentative findings it seems clear that the ideal encoding is highly task-dependent, and we can recommend certain features as useful for specific types of document clustering tasks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 2004 Congress on Evolutionary Computation (IEEE Cat. No.04TH8753)

自引率

0.00%

发文量