具有进化搜索查询的文档聚类

2017 IEEE Congress on Evolutionary Computation (CEC) Pub Date : 2017-06-05 DOI:10.1109/CEC.2017.7969447

Laurence Hirsch, A. D. Nuovo

{"title":"具有进化搜索查询的文档聚类","authors":"Laurence Hirsch, A. D. Nuovo","doi":"10.1109/CEC.2017.7969447","DOIUrl":null,"url":null,"abstract":"Search queries define a set of documents located in a collection and can be used to rank the documents by assigning each document a score according to their closeness to the query in the multidimensional space of weighted terms. In this paper, we describe a system whereby an island model genetic algorithm (GA) creates individuals which can generate a set of Apache Lucene search queries for the purpose of text document clustering. A cluster is specified by the documents returned by a single query in the set. Each document that is included in only one of the clusters adds to the fitness of the individual and each document that is included in more than one cluster will reduce the fitness. The method can be refined by using the ranking score of each document in the fitness test. The system has a number of advantages; in particular, the final search queries are easily understood and offer a simple explanation of the clusters, meaning that an extra cluster labelling stage is not required. We describe how the GA can be used to build queries and show results for clustering on various data sets and with different query sizes. Results are also compared with clusters built using the widely used k-means algorithm.","PeriodicalId":335123,"journal":{"name":"2017 IEEE Congress on Evolutionary Computation (CEC)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"Document clustering with evolved search queries\",\"authors\":\"Laurence Hirsch, A. D. Nuovo\",\"doi\":\"10.1109/CEC.2017.7969447\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Search queries define a set of documents located in a collection and can be used to rank the documents by assigning each document a score according to their closeness to the query in the multidimensional space of weighted terms. In this paper, we describe a system whereby an island model genetic algorithm (GA) creates individuals which can generate a set of Apache Lucene search queries for the purpose of text document clustering. A cluster is specified by the documents returned by a single query in the set. Each document that is included in only one of the clusters adds to the fitness of the individual and each document that is included in more than one cluster will reduce the fitness. The method can be refined by using the ranking score of each document in the fitness test. The system has a number of advantages; in particular, the final search queries are easily understood and offer a simple explanation of the clusters, meaning that an extra cluster labelling stage is not required. We describe how the GA can be used to build queries and show results for clustering on various data sets and with different query sizes. Results are also compared with clusters built using the widely used k-means algorithm.\",\"PeriodicalId\":335123,\"journal\":{\"name\":\"2017 IEEE Congress on Evolutionary Computation (CEC)\",\"volume\":\"41 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-06-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 IEEE Congress on Evolutionary Computation (CEC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CEC.2017.7969447\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 IEEE Congress on Evolutionary Computation (CEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CEC.2017.7969447","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

摘要

搜索查询定义了位于集合中的一组文档，可以根据每个文档在加权词的多维空间中与查询的接近程度为其分配分数，从而对文档进行排序。在本文中，我们描述了一个系统，其中岛模型遗传算法(GA)创建个体，这些个体可以生成一组Apache Lucene搜索查询，用于文本文档聚类。集群由集合中单个查询返回的文档指定。只包含在一个集群中的每个文档增加了个体的适应度，而包含在多个集群中的每个文档将降低适应度。该方法可以通过使用每个文档在适应度检验中的排名分数来改进。该系统有许多优点;特别是，最后的搜索查询很容易理解，并提供了对聚类的简单解释，这意味着不需要额外的聚类标记阶段。我们描述了如何使用遗传算法构建查询并显示在不同数据集和不同查询大小上聚类的结果。结果还与使用广泛使用的k-means算法构建的聚类进行了比较。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Document clustering with evolved search queries

Search queries define a set of documents located in a collection and can be used to rank the documents by assigning each document a score according to their closeness to the query in the multidimensional space of weighted terms. In this paper, we describe a system whereby an island model genetic algorithm (GA) creates individuals which can generate a set of Apache Lucene search queries for the purpose of text document clustering. A cluster is specified by the documents returned by a single query in the set. Each document that is included in only one of the clusters adds to the fitness of the individual and each document that is included in more than one cluster will reduce the fitness. The method can be refined by using the ranking score of each document in the fitness test. The system has a number of advantages; in particular, the final search queries are easily understood and offer a simple explanation of the clusters, meaning that an extra cluster labelling stage is not required. We describe how the GA can be used to build queries and show results for clustering on various data sets and with different query sizes. Results are also compared with clusters built using the widely used k-means algorithm.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 IEEE Congress on Evolutionary Computation (CEC)

自引率

0.00%

发文量