On Efficiently Partitioning a Topic in Apache Kafka

2022 International Conference on Computer, Information and Telecommunication Systems (CITS) Pub Date : 2022-05-19 DOI:10.1109/CITS55221.2022.9832981

Theofanis P. Raptis, A. Passarella

{"title":"On Efficiently Partitioning a Topic in Apache Kafka","authors":"Theofanis P. Raptis, A. Passarella","doi":"10.1109/CITS55221.2022.9832981","DOIUrl":null,"url":null,"abstract":"Apache Kafka addresses the general problem of delivering extreme high volume event data to diverse consumers via a publish-subscribe messaging system. It uses partitions to scale a topic across many brokers for producers to write data in parallel, and also to facilitate parallel reading of consumers. Even though Apache Kafka provides some out of the box optimizations, it does not strictly define how each topic shall be efficiently distributed into partitions. The well-formulated fine-tuning that is needed in order to improve an Apache Kafka cluster performance is still an open research problem. In this paper, we first model the Apache Kafka topic partitioning process for a given topic. Then, given the set of brokers, constraints and application requirements on throughput, OS load, replication latency and unavailability, we formulate the optimization problem of finding how many partitions are needed and show that it is computationally intractable, being an integer program. Furthermore, we propose two simple, yet efficient heuristics to solve the problem: the first tries to minimize and the second to maximize the number of brokers used in the cluster. Finally, we evaluate its performance via largescale simulations, considering as benchmarks some Apache Kafka cluster configuration recommendations provided by Microsoft and Confluent. We demonstrate that, unlike the recommendations, the proposed heuristics respect the hard constraints on replication latency and perform better w.r.t. unavailability time and OS load, using the system resources in a more prudent way.","PeriodicalId":136239,"journal":{"name":"2022 International Conference on Computer, Information and Telecommunication Systems (CITS)","volume":"347 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 International Conference on Computer, Information and Telecommunication Systems (CITS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CITS55221.2022.9832981","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

Apache Kafka addresses the general problem of delivering extreme high volume event data to diverse consumers via a publish-subscribe messaging system. It uses partitions to scale a topic across many brokers for producers to write data in parallel, and also to facilitate parallel reading of consumers. Even though Apache Kafka provides some out of the box optimizations, it does not strictly define how each topic shall be efficiently distributed into partitions. The well-formulated fine-tuning that is needed in order to improve an Apache Kafka cluster performance is still an open research problem. In this paper, we first model the Apache Kafka topic partitioning process for a given topic. Then, given the set of brokers, constraints and application requirements on throughput, OS load, replication latency and unavailability, we formulate the optimization problem of finding how many partitions are needed and show that it is computationally intractable, being an integer program. Furthermore, we propose two simple, yet efficient heuristics to solve the problem: the first tries to minimize and the second to maximize the number of brokers used in the cluster. Finally, we evaluate its performance via largescale simulations, considering as benchmarks some Apache Kafka cluster configuration recommendations provided by Microsoft and Confluent. We demonstrate that, unlike the recommendations, the proposed heuristics respect the hard constraints on replication latency and perform better w.r.t. unavailability time and OS load, using the system resources in a more prudent way.

查看原文本刊更多论文

关于Apache Kafka中主题的高效分区

Apache Kafka解决了通过发布-订阅消息传递系统向不同消费者交付海量事件数据的通用问题。它使用分区跨多个代理扩展主题，以便生产者并行写入数据，也方便消费者并行读取数据。尽管Apache Kafka提供了一些开箱即用的优化，但它并没有严格定义每个主题如何有效地分布到分区中。为了提高Apache Kafka集群性能所需要的精心制定的微调仍然是一个开放的研究问题。在本文中，我们首先对给定主题的Apache Kafka主题分区过程进行建模。然后，给定一组代理、约束和应用程序对吞吐量、操作系统负载、复制延迟和不可用性的需求，我们制定了寻找需要多少分区的优化问题，并表明它是计算难以处理的，是一个整数程序。此外，我们提出了两个简单而有效的启发式方法来解决这个问题:第一个尝试最小化集群中使用的代理数量，第二个尝试最大化集群中使用的代理数量。最后，我们通过大规模模拟来评估它的性能，将微软和Confluent提供的一些Apache Kafka集群配置建议作为基准。我们证明，与建议不同，所提出的启发式方法尊重复制延迟的硬约束，并以更谨慎的方式使用系统资源，更好地执行不可用时间和操作系统负载。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2022 International Conference on Computer, Information and Telecommunication Systems (CITS)

自引率

0.00%

发文量