HTLL: Latency-Aware Scalable Blocking Mutex

IF 6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems Pub Date : 2025-01-07 DOI:10.1109/TPDS.2025.3526859

Ziqu Yu;Jinyu Gu;Zijian Wu;Nian Liu;Jian Guo

{"title":"HTLL: Latency-Aware Scalable Blocking Mutex","authors":"Ziqu Yu;Jinyu Gu;Zijian Wu;Nian Liu;Jian Guo","doi":"10.1109/TPDS.2025.3526859","DOIUrl":null,"url":null,"abstract":"This article finds that existing mutex locks suffer from throughput collapses or latency collapses, or both, in the oversubscribed scenarios where applications create more threads than the CPU core number, e.g., database applications like mysql use per thread per connection. We make an in-depth performance analysis on existing locks and then identify three design rules for the lock primitive to achieve scalable performance in oversubscribed scenarios. First, to achieve ideal throughput, the lock design should keep adequate number of active competitors. Second, the active competitors should be arranged carefully to avoid the lock-holder preemption problem. Third, to meet latency requirements, the lock design should track the latency of each competitor and reorder the competitors according to the latency requirement. We propose a new lock library called HTLL that satisfies these rules and achieves both high throughput and low latency even when the cores are oversubscribed. HTLL only requires minimal human effort (e.g., add several lines of code) to annotate the latency requirement. Evaluation results show that HTLL achieves scalable performance in the oversubscribed scenarios. Specifically, for the real-world database, LMDB, HTLL can reduce the tail latency by up to 97% with only an average 5% degradation in throughput, compared with state-of-the-art alternatives such as Malthusian, CST, and Mutexee locks; In comparison to the widely used pthread_mutex_lock, it can increase the throughput by up to 22% and decrease the latency by up to 80%. Meanwhile, for the under-subscribed scenarios, it also shows comparable performance than state-of-the-art blocking locks.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 3","pages":"471-486"},"PeriodicalIF":6.0000,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Parallel and Distributed Systems","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10830557/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}

引用次数: 0

Abstract

This article finds that existing mutex locks suffer from throughput collapses or latency collapses, or both, in the oversubscribed scenarios where applications create more threads than the CPU core number, e.g., database applications like mysql use per thread per connection. We make an in-depth performance analysis on existing locks and then identify three design rules for the lock primitive to achieve scalable performance in oversubscribed scenarios. First, to achieve ideal throughput, the lock design should keep adequate number of active competitors. Second, the active competitors should be arranged carefully to avoid the lock-holder preemption problem. Third, to meet latency requirements, the lock design should track the latency of each competitor and reorder the competitors according to the latency requirement. We propose a new lock library called HTLL that satisfies these rules and achieves both high throughput and low latency even when the cores are oversubscribed. HTLL only requires minimal human effort (e.g., add several lines of code) to annotate the latency requirement. Evaluation results show that HTLL achieves scalable performance in the oversubscribed scenarios. Specifically, for the real-world database, LMDB, HTLL can reduce the tail latency by up to 97% with only an average 5% degradation in throughput, compared with state-of-the-art alternatives such as Malthusian, CST, and Mutexee locks; In comparison to the widely used pthread_mutex_lock, it can increase the throughput by up to 22% and decrease the latency by up to 80%. Meanwhile, for the under-subscribed scenarios, it also shows comparable performance than state-of-the-art blocking locks.

查看原文本刊更多论文

延迟感知的可伸缩阻塞互斥

本文发现，在应用程序创建的线程数量超过CPU核心数量的情况下（例如，mysql等数据库应用程序每个线程每个连接使用一个线程），现有互斥锁会遭受吞吐量崩溃或延迟崩溃，或者两者兼而有之。我们对现有锁进行了深入的性能分析，然后确定了锁原语的三个设计规则，以便在过度订阅的场景中实现可扩展的性能。首先，为了达到理想的吞吐量，锁的设计应该保持足够数量的活跃竞争对手。其次，应谨慎安排活跃的竞争对手，避免出现持股人抢占的问题。第三，为了满足延迟要求，锁设计应该跟踪每个竞争对手的延迟，并根据延迟要求对竞争对手进行重新排序。我们提出了一个新的锁库，称为html，它满足这些规则，即使在核心被超额订阅时也能实现高吞吐量和低延迟。html只需要最少的人力（例如，添加几行代码）来标注延迟需求。评估结果表明，在超订阅场景下，html实现了可扩展的性能。具体来说，对于现实世界的数据库LMDB，与马尔萨斯锁、CST锁和Mutexee锁等最先进的替代方案相比，html可以将尾部延迟减少97%，而吞吐量平均只降低5%；与广泛使用的pthread_mutex_lock相比，它最多可以将吞吐量提高22%，并将延迟减少80%。同时，对于订阅不足的场景，它也显示出与最先进的阻塞锁相当的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Parallel and Distributed Systems 工程技术-工程：电子与电气

CiteScore

11.00

自引率

9.40%

发文量

281

审稿时长

5.6 months

期刊介绍： IEEE Transactions on Parallel and Distributed Systems (TPDS) is published monthly. It publishes a range of papers, comments on previously published papers, and survey articles that deal with the parallel and distributed systems research areas of current importance to our readers. Particular areas of interest include, but are not limited to: a) Parallel and distributed algorithms, focusing on topics such as: models of computation; numerical, combinatorial, and data-intensive parallel algorithms, scalability of algorithms and data structures for parallel and distributed systems, communication and synchronization protocols, network algorithms, scheduling, and load balancing. b) Applications of parallel and distributed computing, including computational and data-enabled science and engineering, big data applications, parallel crowd sourcing, large-scale social network analysis, management of big data, cloud and grid computing, scientific and biomedical applications, mobile computing, and cyber-physical systems. c) Parallel and distributed architectures, including architectures for instruction-level and thread-level parallelism; design, analysis, implementation, fault resilience and performance measurements of multiple-processor systems; multicore processors, heterogeneous many-core systems; petascale and exascale systems designs; novel big data architectures; special purpose architectures, including graphics processors, signal processors, network processors, media accelerators, and other special purpose processors and accelerators; impact of technology on architecture; network and interconnect architectures; parallel I/O and storage systems; architecture of the memory hierarchy; power-efficient and green computing architectures; dependable architectures; and performance modeling and evaluation. d) Parallel and distributed software, including parallel and multicore programming languages and compilers, runtime systems, operating systems, Internet computing and web services, resource management including green computing, middleware for grids, clouds, and data centers, libraries, performance modeling and evaluation, parallel programming paradigms, and programming environments and tools.