Genetic Programming for Evolving Similarity Functions for Clustering: Representations and Analysis

IF 4.6 2区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE
Andrew Lensen;Bing Xue;Mengjie Zhang
{"title":"Genetic Programming for Evolving Similarity Functions for Clustering: Representations and Analysis","authors":"Andrew Lensen;Bing Xue;Mengjie Zhang","doi":"10.1162/evco_a_00264","DOIUrl":null,"url":null,"abstract":"<para>Clustering is a difficult and widely studied data mining task, with many varieties of clustering algorithms proposed in the literature. Nearly all algorithms use a similarity measure such as a distance metric (e.g., Euclidean distance) to decide which instances to assign to the same cluster. These similarity measures are generally predefined and cannot be easily tailored to the properties of a particular dataset, which leads to limitations in the quality and the interpretability of the clusters produced. In this article, we propose a new approach to automatically evolving similarity functions for a given clustering algorithm by using genetic programming. We introduce a new genetic programming-based method which automatically selects a small subset of features (feature selection) and then combines them using a variety of functions (feature construction) to produce dynamic and flexible similarity functions that are specifically designed for a given dataset. We demonstrate how the evolved similarity functions can be used to perform clustering using a graph-based representation. The results of a variety of experiments across a range of large, high-dimensional datasets show that the proposed approach can achieve higher and more consistent performance than the benchmark methods. We further extend the proposed approach to automatically produce multiple complementary similarity functions by using a multi-tree approach, which gives further performance improvements. We also analyse the interpretability and structure of the automatically evolved similarity functions to provide insight into how and why they are superior to standard distance metrics.</para>","PeriodicalId":50470,"journal":{"name":"Evolutionary Computation","volume":"28 4","pages":"531-561"},"PeriodicalIF":4.6000,"publicationDate":"2020-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1162/evco_a_00264","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Evolutionary Computation","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/9277963/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 12

Abstract

Clustering is a difficult and widely studied data mining task, with many varieties of clustering algorithms proposed in the literature. Nearly all algorithms use a similarity measure such as a distance metric (e.g., Euclidean distance) to decide which instances to assign to the same cluster. These similarity measures are generally predefined and cannot be easily tailored to the properties of a particular dataset, which leads to limitations in the quality and the interpretability of the clusters produced. In this article, we propose a new approach to automatically evolving similarity functions for a given clustering algorithm by using genetic programming. We introduce a new genetic programming-based method which automatically selects a small subset of features (feature selection) and then combines them using a variety of functions (feature construction) to produce dynamic and flexible similarity functions that are specifically designed for a given dataset. We demonstrate how the evolved similarity functions can be used to perform clustering using a graph-based representation. The results of a variety of experiments across a range of large, high-dimensional datasets show that the proposed approach can achieve higher and more consistent performance than the benchmark methods. We further extend the proposed approach to automatically produce multiple complementary similarity functions by using a multi-tree approach, which gives further performance improvements. We also analyse the interpretability and structure of the automatically evolved similarity functions to provide insight into how and why they are superior to standard distance metrics.
进化聚类相似函数的遗传规划:表示与分析
聚类是一项困难且研究广泛的数据挖掘任务,文献中提出了多种聚类算法。几乎所有的算法都使用相似性度量,例如距离度量(例如欧几里得距离)来决定将哪些实例分配给同一集群。这些相似性度量通常是预定义的,无法根据特定数据集的属性轻松调整,这导致所产生的聚类的质量和可解释性受到限制。在本文中,我们提出了一种新的方法,通过使用遗传规划来自动进化给定聚类算法的相似性函数。我们介绍了一种新的基于遗传编程的方法,该方法自动选择一小部分特征(特征选择),然后使用各种函数(特征构建)将它们组合起来,以生成专门为给定数据集设计的动态灵活的相似函数。我们展示了如何使用进化的相似性函数来使用基于图的表示进行聚类。在一系列大型高维数据集上进行的各种实验结果表明,与基准方法相比,所提出的方法可以实现更高、更一致的性能。我们进一步扩展了所提出的方法,通过使用多树方法自动生成多个互补相似函数,这进一步提高了性能。我们还分析了自动进化相似性函数的可解释性和结构,以深入了解它们如何以及为什么优于标准距离度量。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
Evolutionary Computation
Evolutionary Computation 工程技术-计算机:理论方法
CiteScore
6.40
自引率
1.50%
发文量
20
审稿时长
3 months
期刊介绍: Evolutionary Computation is a leading journal in its field. It provides an international forum for facilitating and enhancing the exchange of information among researchers involved in both the theoretical and practical aspects of computational systems drawing their inspiration from nature, with particular emphasis on evolutionary models of computation such as genetic algorithms, evolutionary strategies, classifier systems, evolutionary programming, and genetic programming. It welcomes articles from related fields such as swarm intelligence (e.g. Ant Colony Optimization and Particle Swarm Optimization), and other nature-inspired computation paradigms (e.g. Artificial Immune Systems). As well as publishing articles describing theoretical and/or experimental work, the journal also welcomes application-focused papers describing breakthrough results in an application domain or methodological papers where the specificities of the real-world problem led to significant algorithmic improvements that could possibly be generalized to other areas.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信