确定性聚类算法中实现诱导的不一致性和不确定性

2020 IEEE 13th International Conference on Software Testing, Validation and Verification (ICST) Pub Date : 2020-10-01 DOI:10.1109/icst46399.2020.00032

Xin Yin, Iulian Neamtiu, Saketan Patil, Sean T. Andrews

{"title":"确定性聚类算法中实现诱导的不一致性和不确定性","authors":"Xin Yin, Iulian Neamtiu, Saketan Patil, Sean T. Andrews","doi":"10.1109/icst46399.2020.00032","DOIUrl":null,"url":null,"abstract":"A deterministic clustering algorithm is designed to always produce the same clustering solution on a given input. Therefore, users of clustering implementations (toolkits) naturally assume that implementations of a deterministic clustering algorithm A have deterministic behavior, that is: (1) two different implementations I1 and I2 of A are interchangeable, producing the same clustering on a given input D, and (2) an implementation produces the same clustering solution when run repeatedly on D. We challenge these assumptions. Specifically, we analyze clustering behavior on 528 datasets, three deterministic algorithms (Affinity Propagation, DBSCAN, Hierarchical Agglomerative Clustering) and the deterministic portion of a fourth (K-means), as implemented in various toolkits; in total, we examined 13 algorithm-toolkit combinations. We found that different implementations of deterministic clustering algorithms make different choices, e.g., default parameter settings, noise insertion, input dataset characteristics. As a result, clustering solutions for a fixed algorithm-dataset combination can differ across runs (nondeterminism) and across toolkits (inconsistency). We expose several root causes of such behavior. We show that remedying these root causes improves determinism, increases consistency, and can even improve efficiency. Our approach and findings can benefit developers, testers, and users of clustering algorithms.","PeriodicalId":235967,"journal":{"name":"2020 IEEE 13th International Conference on Software Testing, Validation and Verification (ICST)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Implementation-induced Inconsistency and Nondeterminism in Deterministic Clustering Algorithms\",\"authors\":\"Xin Yin, Iulian Neamtiu, Saketan Patil, Sean T. Andrews\",\"doi\":\"10.1109/icst46399.2020.00032\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"A deterministic clustering algorithm is designed to always produce the same clustering solution on a given input. Therefore, users of clustering implementations (toolkits) naturally assume that implementations of a deterministic clustering algorithm A have deterministic behavior, that is: (1) two different implementations I1 and I2 of A are interchangeable, producing the same clustering on a given input D, and (2) an implementation produces the same clustering solution when run repeatedly on D. We challenge these assumptions. Specifically, we analyze clustering behavior on 528 datasets, three deterministic algorithms (Affinity Propagation, DBSCAN, Hierarchical Agglomerative Clustering) and the deterministic portion of a fourth (K-means), as implemented in various toolkits; in total, we examined 13 algorithm-toolkit combinations. We found that different implementations of deterministic clustering algorithms make different choices, e.g., default parameter settings, noise insertion, input dataset characteristics. As a result, clustering solutions for a fixed algorithm-dataset combination can differ across runs (nondeterminism) and across toolkits (inconsistency). We expose several root causes of such behavior. We show that remedying these root causes improves determinism, increases consistency, and can even improve efficiency. Our approach and findings can benefit developers, testers, and users of clustering algorithms.\",\"PeriodicalId\":235967,\"journal\":{\"name\":\"2020 IEEE 13th International Conference on Software Testing, Validation and Verification (ICST)\",\"volume\":\"4 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-10-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 IEEE 13th International Conference on Software Testing, Validation and Verification (ICST)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/icst46399.2020.00032\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 13th International Conference on Software Testing, Validation and Verification (ICST)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/icst46399.2020.00032","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

确定性聚类算法旨在对给定输入始终产生相同的聚类解。因此，聚类实现(工具包)的用户自然会假设确定性聚类算法a的实现具有确定性行为，即:(1)a的两个不同实现I1和I2是可互换的，在给定输入D上产生相同的聚类;(2)在D上重复运行时，实现产生相同的聚类解决方案。具体来说，我们分析了528个数据集上的聚类行为，三种确定性算法(Affinity Propagation, DBSCAN, Hierarchical Agglomerative clustering)和第四种确定性算法(K-means)的确定性部分，这些算法在各种工具包中实现;我们总共检查了13种算法工具包组合。我们发现确定性聚类算法的不同实现会做出不同的选择，例如默认参数设置、噪声插入、输入数据集特征。因此，针对固定算法-数据集组合的聚类解决方案在不同的运行(不确定性)和不同的工具包(不一致性)之间可能有所不同。我们揭露了这种行为的几个根本原因。我们表明，纠正这些根本原因可以提高确定性，增加一致性，甚至可以提高效率。我们的方法和发现可以使聚类算法的开发人员、测试人员和用户受益。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Implementation-induced Inconsistency and Nondeterminism in Deterministic Clustering Algorithms

A deterministic clustering algorithm is designed to always produce the same clustering solution on a given input. Therefore, users of clustering implementations (toolkits) naturally assume that implementations of a deterministic clustering algorithm A have deterministic behavior, that is: (1) two different implementations I1 and I2 of A are interchangeable, producing the same clustering on a given input D, and (2) an implementation produces the same clustering solution when run repeatedly on D. We challenge these assumptions. Specifically, we analyze clustering behavior on 528 datasets, three deterministic algorithms (Affinity Propagation, DBSCAN, Hierarchical Agglomerative Clustering) and the deterministic portion of a fourth (K-means), as implemented in various toolkits; in total, we examined 13 algorithm-toolkit combinations. We found that different implementations of deterministic clustering algorithms make different choices, e.g., default parameter settings, noise insertion, input dataset characteristics. As a result, clustering solutions for a fixed algorithm-dataset combination can differ across runs (nondeterminism) and across toolkits (inconsistency). We expose several root causes of such behavior. We show that remedying these root causes improves determinism, increases consistency, and can even improve efficiency. Our approach and findings can benefit developers, testers, and users of clustering algorithms.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2020 IEEE 13th International Conference on Software Testing, Validation and Verification (ICST)

自引率

0.00%

发文量