一种改进的有能力相关聚类问题逼近算法

IF 0.6 4区计算机科学 Q4 COMPUTER SCIENCE, THEORY & METHODS

International Journal of Foundations of Computer Science Pub Date : 2023-04-27 DOI:10.1142/s0129054123410010

Sai Ji, Yukun Cheng, Jingjing Tan, Zhongrui Zhao

{"title":"一种改进的有能力相关聚类问题逼近算法","authors":"Sai Ji, Yukun Cheng, Jingjing Tan, Zhongrui Zhao","doi":"10.1142/s0129054123410010","DOIUrl":null,"url":null,"abstract":"Correlation clustering problem (CorCP) is a classical clustering problem, which clusters data based on the similarity of data set, and has many applications in interaction networks, cross-lingual link detection, and communication networks, etc. In this paper, we study a practical generalization of the CorCP, called the capacitated correlation clustering problem (the capacitated CorCP), by constructing a labeled complete graph. On this labeled complete graph, each vertex represents a piece of data. If two pieces of data are similar, then the edge between the corresponding vertices is marked by a positive label [Formula: see text]. Otherwise, this edge is marked by a negative label −. The objective of the capacitated CorCP is to group some similar data sets into one cluster as far as possible, while satisfying the cluster capacity constraint. To achieve this objective, we shall partition the vertex set of the labeled complete graph into several clusters, each cluster’s size subjecting to an upper bound, so as to minimize the number of disagreements. Here the number of disagreements is defined as the total number of the edges with positive labels between clusters and the edges with negative labels within clusters. Different with the previous algorithm in [18], which subjects to the constraint on the cluster size by a penalty measure, we design an algorithm for the capacitated CorCP to directly output a feasible solution by iteratively constructing clusters based on a preset threshold. Through carefully setting the threshold and sophisticatedly analyzing, our algorithm is proved to have an improved approximation ratio of 5.37. In addition, we also conduct a series of numerical experiments to demonstrate the effectiveness of our algorithm.","PeriodicalId":50323,"journal":{"name":"International Journal of Foundations of Computer Science","volume":"20 1","pages":"0"},"PeriodicalIF":0.6000,"publicationDate":"2023-04-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"An Improved Approximation Algorithm for the Capacitated Correlation Clustering Problem\",\"authors\":\"Sai Ji, Yukun Cheng, Jingjing Tan, Zhongrui Zhao\",\"doi\":\"10.1142/s0129054123410010\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Correlation clustering problem (CorCP) is a classical clustering problem, which clusters data based on the similarity of data set, and has many applications in interaction networks, cross-lingual link detection, and communication networks, etc. In this paper, we study a practical generalization of the CorCP, called the capacitated correlation clustering problem (the capacitated CorCP), by constructing a labeled complete graph. On this labeled complete graph, each vertex represents a piece of data. If two pieces of data are similar, then the edge between the corresponding vertices is marked by a positive label [Formula: see text]. Otherwise, this edge is marked by a negative label −. The objective of the capacitated CorCP is to group some similar data sets into one cluster as far as possible, while satisfying the cluster capacity constraint. To achieve this objective, we shall partition the vertex set of the labeled complete graph into several clusters, each cluster’s size subjecting to an upper bound, so as to minimize the number of disagreements. Here the number of disagreements is defined as the total number of the edges with positive labels between clusters and the edges with negative labels within clusters. Different with the previous algorithm in [18], which subjects to the constraint on the cluster size by a penalty measure, we design an algorithm for the capacitated CorCP to directly output a feasible solution by iteratively constructing clusters based on a preset threshold. Through carefully setting the threshold and sophisticatedly analyzing, our algorithm is proved to have an improved approximation ratio of 5.37. In addition, we also conduct a series of numerical experiments to demonstrate the effectiveness of our algorithm.\",\"PeriodicalId\":50323,\"journal\":{\"name\":\"International Journal of Foundations of Computer Science\",\"volume\":\"20 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.6000,\"publicationDate\":\"2023-04-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"International Journal of Foundations of Computer Science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1142/s0129054123410010\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Foundations of Computer Science","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1142/s0129054123410010","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

摘要

关联聚类问题(CorCP)是一种基于数据集相似性对数据进行聚类的经典聚类问题，在交互网络、跨语言链接检测、通信网络等领域有着广泛的应用。本文通过构造一个标记完全图，研究了CorCP的一种实际推广，即有能力相关聚类问题(capacitated CorCP)。在这个带标签的完全图上，每个顶点代表一段数据。如果两组数据相似，则对应顶点之间的边用正标签标记[公式:见文]。否则，该边被标记为负的“−”。容量化CorCP的目标是在满足集群容量约束的情况下，尽可能地将一些相似的数据集集中到一个集群中。为了实现这一目标，我们将标记的完全图的顶点集划分为几个簇，每个簇的大小有一个上界，以最小化分歧的数量。这里的分歧数被定义为聚类之间带正标签的边和聚类内带负标签的边的总数。与先前[18]算法通过惩罚措施约束聚类大小不同，我们设计了一种算法，使有能力的CorCP根据预设阈值迭代构造聚类，直接输出可行解。通过对阈值的精心设置和细致的分析，我们的算法得到了5.37的改进近似比。此外，我们还进行了一系列的数值实验来验证算法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

An Improved Approximation Algorithm for the Capacitated Correlation Clustering Problem

Correlation clustering problem (CorCP) is a classical clustering problem, which clusters data based on the similarity of data set, and has many applications in interaction networks, cross-lingual link detection, and communication networks, etc. In this paper, we study a practical generalization of the CorCP, called the capacitated correlation clustering problem (the capacitated CorCP), by constructing a labeled complete graph. On this labeled complete graph, each vertex represents a piece of data. If two pieces of data are similar, then the edge between the corresponding vertices is marked by a positive label [Formula: see text]. Otherwise, this edge is marked by a negative label −. The objective of the capacitated CorCP is to group some similar data sets into one cluster as far as possible, while satisfying the cluster capacity constraint. To achieve this objective, we shall partition the vertex set of the labeled complete graph into several clusters, each cluster’s size subjecting to an upper bound, so as to minimize the number of disagreements. Here the number of disagreements is defined as the total number of the edges with positive labels between clusters and the edges with negative labels within clusters. Different with the previous algorithm in [18], which subjects to the constraint on the cluster size by a penalty measure, we design an algorithm for the capacitated CorCP to directly output a feasible solution by iteratively constructing clusters based on a preset threshold. Through carefully setting the threshold and sophisticatedly analyzing, our algorithm is proved to have an improved approximation ratio of 5.37. In addition, we also conduct a series of numerical experiments to demonstrate the effectiveness of our algorithm.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

International Journal of Foundations of Computer Science 工程技术-计算机：理论方法

CiteScore

1.60

自引率

12.50%

发文量

审稿时长

3 months

期刊介绍： The International Journal of Foundations of Computer Science is a bimonthly journal that publishes articles which contribute new theoretical results in all areas of the foundations of computer science. The theoretical and mathematical aspects covered include: - Algebraic theory of computing and formal systems - Algorithm and system implementation issues - Approximation, probabilistic, and randomized algorithms - Automata and formal languages - Automated deduction - Combinatorics and graph theory - Complexity theory - Computational biology and bioinformatics - Cryptography - Database theory - Data structures - Design and analysis of algorithms - DNA computing - Foundations of computer security - Foundations of high-performance computing