Greedy Centroid Initialization for Federated K-means

2023 57th Annual Conference on Information Sciences and Systems (CISS) Pub Date : 2023-03-22 DOI:10.1109/CISS56502.2023.10089666

Kun Yang, M. Amiri, Sanjeev R. Kulkarni

{"title":"Greedy Centroid Initialization for Federated K-means","authors":"Kun Yang, M. Amiri, Sanjeev R. Kulkarni","doi":"10.1109/CISS56502.2023.10089666","DOIUrl":null,"url":null,"abstract":"K-means is a widely used data clustering algorithm which aims to partition a set of data points into $K$ clusters through finding the best $K$ centroids representing the data points. Initialization plays a vital role in the traditional centralized K-means clustering algorithm where the clustering is carried out at a central node accessing the entire data points. In this paper, we focus on K-means in a federated setting, where the clients store data locally, and the raw data never leaves the devices. Given the importance of initialization on the federated K-means algorithm, we aim to find better initial centroids by leveraging the local data on each client. To this end, we start the centroid initialization at the clients rather than at the server, which has no information about the clients' data initially. The clients first select their local initial clusters, and they share their clustering information (cluster centroids and sizes) with the server. The server then uses a greedy algorithm to choose the global initial centroids based on the information received from the clients. Numerical results on synthetic and public datasets show that our proposed method can achieve better and more stable performance than three federated K-means variants, and similar performance to the centralized K-means algorithm.","PeriodicalId":243775,"journal":{"name":"2023 57th Annual Conference on Information Sciences and Systems (CISS)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 57th Annual Conference on Information Sciences and Systems (CISS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CISS56502.2023.10089666","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

K-means is a widely used data clustering algorithm which aims to partition a set of data points into $K$ clusters through finding the best $K$ centroids representing the data points. Initialization plays a vital role in the traditional centralized K-means clustering algorithm where the clustering is carried out at a central node accessing the entire data points. In this paper, we focus on K-means in a federated setting, where the clients store data locally, and the raw data never leaves the devices. Given the importance of initialization on the federated K-means algorithm, we aim to find better initial centroids by leveraging the local data on each client. To this end, we start the centroid initialization at the clients rather than at the server, which has no information about the clients' data initially. The clients first select their local initial clusters, and they share their clustering information (cluster centroids and sizes) with the server. The server then uses a greedy algorithm to choose the global initial centroids based on the information received from the clients. Numerical results on synthetic and public datasets show that our proposed method can achieve better and more stable performance than three federated K-means variants, and similar performance to the centralized K-means algorithm.

查看原文本刊更多论文

联邦k均值的贪婪质心初始化

K-means是一种广泛使用的数据聚类算法，它旨在通过寻找代表数据点的最佳K个质心，将一组数据点划分为K个聚类。在传统的集中式k均值聚类算法中，初始化起着至关重要的作用。在传统的k均值聚类算法中，聚类是在访问整个数据点的中心节点上进行的。在本文中，我们关注联邦设置中的K-means，其中客户端在本地存储数据，原始数据永远不会离开设备。考虑到初始化对联邦K-means算法的重要性，我们的目标是通过利用每个客户端的本地数据来找到更好的初始质心。为此，我们在客户机而不是服务器上开始质心初始化，因为服务器最初没有关于客户机数据的信息。客户机首先选择它们的本地初始集群，然后与服务器共享它们的集群信息(集群质心和大小)。然后，服务器使用贪婪算法根据从客户端接收到的信息选择全局初始质心。在综合数据集和公共数据集上的数值结果表明，该方法比三种联合K-means算法性能更好、更稳定，与集中式K-means算法性能相近。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2023 57th Annual Conference on Information Sciences and Systems (CISS)

自引率

0.00%

发文量