基于熵的分布式数据聚类一致性

Journal of Artificial Intelligence and Data Mining Pub Date : 2019-11-01 DOI:10.22044/JADM.2018.4237.1514

M. Owhadi-Kareshki, M. Akbarzadeh-T.

{"title":"基于熵的分布式数据聚类一致性","authors":"M. Owhadi-Kareshki, M. Akbarzadeh-T.","doi":"10.22044/JADM.2018.4237.1514","DOIUrl":null,"url":null,"abstract":"The increasingly larger scale of available data and the more restrictive concerns on their privacy are some of the challenging aspects of data mining today. In this paper, Entropy-based Consensus on Cluster Centers (EC3) is introduced for clustering in distributed systems with a consideration for confidentiality of data; i.e. it is the negotiations among local cluster centers that are used in the consensus process, hence no private data are transferred. With the proposed use of entropy as an internal measure of consensus clustering validation at each machine, the cluster centers of the local machines with higher expected clustering validity have more influence in the final consensus centers. We also employ relative cost function of the local Fuzzy C-Means (FCM) and the number of data points in each machine as measures of relative machine validity as compared to other machines and its reliability, respectively. The utility of the proposed consensus strategy is examined on 18 datasets from the UCI repository in terms of clustering accuracy and speed up against the centralized version of FCM. Several experiments confirm that the proposed approach yields to higher speed up and accuracy while maintaining data security due to its protected and distributed processing approach.","PeriodicalId":32592,"journal":{"name":"Journal of Artificial Intelligence and Data Mining","volume":"7 1","pages":"551-561"},"PeriodicalIF":0.0000,"publicationDate":"2019-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Entropy-based Consensus for Distributed Data Clustering\",\"authors\":\"M. Owhadi-Kareshki, M. Akbarzadeh-T.\",\"doi\":\"10.22044/JADM.2018.4237.1514\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The increasingly larger scale of available data and the more restrictive concerns on their privacy are some of the challenging aspects of data mining today. In this paper, Entropy-based Consensus on Cluster Centers (EC3) is introduced for clustering in distributed systems with a consideration for confidentiality of data; i.e. it is the negotiations among local cluster centers that are used in the consensus process, hence no private data are transferred. With the proposed use of entropy as an internal measure of consensus clustering validation at each machine, the cluster centers of the local machines with higher expected clustering validity have more influence in the final consensus centers. We also employ relative cost function of the local Fuzzy C-Means (FCM) and the number of data points in each machine as measures of relative machine validity as compared to other machines and its reliability, respectively. The utility of the proposed consensus strategy is examined on 18 datasets from the UCI repository in terms of clustering accuracy and speed up against the centralized version of FCM. Several experiments confirm that the proposed approach yields to higher speed up and accuracy while maintaining data security due to its protected and distributed processing approach.\",\"PeriodicalId\":32592,\"journal\":{\"name\":\"Journal of Artificial Intelligence and Data Mining\",\"volume\":\"7 1\",\"pages\":\"551-561\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Artificial Intelligence and Data Mining\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.22044/JADM.2018.4237.1514\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Artificial Intelligence and Data Mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.22044/JADM.2018.4237.1514","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

可用数据的规模越来越大，对其隐私的限制也越来越严格，这是当今数据挖掘的一些挑战性方面。本文将基于熵的聚类中心一致性（EC3）引入分布式系统中，并考虑数据的机密性；即，在协商过程中使用的是本地集群中心之间的协商，因此不传输私有数据。由于建议使用熵作为每台机器上一致性聚类验证的内部度量，具有较高预期聚类有效性的本地机器的聚类中心对最终一致性中心有更大的影响。我们还使用局部模糊C均值（FCM）的相对成本函数和每台机器中的数据点数量分别作为与其他机器相比的相对机器有效性及其可靠性的度量。在UCI存储库的18个数据集上检验了所提出的一致性策略在聚类精度和速度方面的效用，与集中式FCM相比。几项实验证实，由于其受保护的分布式处理方法，所提出的方法在保持数据安全的同时具有更高的速度和准确性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Entropy-based Consensus for Distributed Data Clustering

The increasingly larger scale of available data and the more restrictive concerns on their privacy are some of the challenging aspects of data mining today. In this paper, Entropy-based Consensus on Cluster Centers (EC3) is introduced for clustering in distributed systems with a consideration for confidentiality of data; i.e. it is the negotiations among local cluster centers that are used in the consensus process, hence no private data are transferred. With the proposed use of entropy as an internal measure of consensus clustering validation at each machine, the cluster centers of the local machines with higher expected clustering validity have more influence in the final consensus centers. We also employ relative cost function of the local Fuzzy C-Means (FCM) and the number of data points in each machine as measures of relative machine validity as compared to other machines and its reliability, respectively. The utility of the proposed consensus strategy is examined on 18 datasets from the UCI repository in terms of clustering accuracy and speed up against the centralized version of FCM. Several experiments confirm that the proposed approach yields to higher speed up and accuracy while maintaining data security due to its protected and distributed processing approach.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Artificial Intelligence and Data Mining

自引率

0.00%

发文量

审稿时长

8 weeks