SuperMalloc: a super fast multithreaded malloc for 64-bit machines

Bradley C. Kuszmaul
{"title":"SuperMalloc: a super fast multithreaded malloc for 64-bit machines","authors":"Bradley C. Kuszmaul","doi":"10.1145/2754169.2754178","DOIUrl":null,"url":null,"abstract":"SuperMalloc is an implementation of malloc(3) originally designed for X86 Hardware Transactional Memory (HTM)@. It turns out that the same design decisions also make it fast even without HTM@. For the malloc-test benchmark, which is one of the most difficult workloads for an allocator, with one thread SuperMalloc is about 2.1 times faster than the best of DLmalloc, JEmalloc, Hoard, and TBBmalloc; with 8 threads and HTM, SuperMalloc is 2.75 times faster; and on 32 threads without HTM SuperMalloc is 3.4 times faster. SuperMalloc generally compares favorably with the other allocators on speed, scalability, speed variance, memory footprint, and code size. SuperMalloc achieves these performance advantages using less than half as much code as the alternatives. SuperMalloc exploits the fact that although physical memory is always precious, virtual address space on a 64-bit machine is relatively cheap. It allocates 2 chunks which contain objects all the same size. To translate chunk numbers to chunk metadata, SuperMalloc uses a simple array (most of which is uncommitted to physical memory). SuperMalloc takes care to avoid associativity conflicts in the cache: most of the size classes are a prime number of cache lines, and nonaligned huge accesses are randomly aligned within a page. Objects are allocated from the fullest non-full page in the appropriate size class. For each size class, SuperMalloc employs a 10-object per-thread cache, a per-CPU cache that holds about a level-2-cache worth of objects per size class, and a global cache that is organized to allow the movement of many objects between a per-CPU cache and the global cache using $O(1)$ instructions. SuperMalloc prefetches everything it can before starting a critical section, which makes the critical sections run fast, and for HTM improves the odds that the transaction will commit.","PeriodicalId":136399,"journal":{"name":"Proceedings of the 2015 International Symposium on Memory Management","volume":"11 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-06-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"42","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2015 International Symposium on Memory Management","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2754169.2754178","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 42

Abstract

SuperMalloc is an implementation of malloc(3) originally designed for X86 Hardware Transactional Memory (HTM)@. It turns out that the same design decisions also make it fast even without HTM@. For the malloc-test benchmark, which is one of the most difficult workloads for an allocator, with one thread SuperMalloc is about 2.1 times faster than the best of DLmalloc, JEmalloc, Hoard, and TBBmalloc; with 8 threads and HTM, SuperMalloc is 2.75 times faster; and on 32 threads without HTM SuperMalloc is 3.4 times faster. SuperMalloc generally compares favorably with the other allocators on speed, scalability, speed variance, memory footprint, and code size. SuperMalloc achieves these performance advantages using less than half as much code as the alternatives. SuperMalloc exploits the fact that although physical memory is always precious, virtual address space on a 64-bit machine is relatively cheap. It allocates 2 chunks which contain objects all the same size. To translate chunk numbers to chunk metadata, SuperMalloc uses a simple array (most of which is uncommitted to physical memory). SuperMalloc takes care to avoid associativity conflicts in the cache: most of the size classes are a prime number of cache lines, and nonaligned huge accesses are randomly aligned within a page. Objects are allocated from the fullest non-full page in the appropriate size class. For each size class, SuperMalloc employs a 10-object per-thread cache, a per-CPU cache that holds about a level-2-cache worth of objects per size class, and a global cache that is organized to allow the movement of many objects between a per-CPU cache and the global cache using $O(1)$ instructions. SuperMalloc prefetches everything it can before starting a critical section, which makes the critical sections run fast, and for HTM improves the odds that the transaction will commit.
SuperMalloc:一个超快的64位机器多线程malloc
SuperMalloc是malloc(3)的实现,最初是为X86硬件事务性内存(HTM)@设计的。事实证明,同样的设计决策也使它即使没有HTM@.也很快对于malloc-test基准测试(这是分配器最困难的工作负载之一),一个线程的SuperMalloc比最好的DLmalloc、JEmalloc、Hoard和TBBmalloc快2.1倍;使用8个线程和HTM, SuperMalloc快2.75倍;在32个线程中,没有HTM SuperMalloc的速度要快3.4倍。SuperMalloc通常在速度、可伸缩性、速度差异、内存占用和代码大小方面优于其他分配器。SuperMalloc使用的代码不到替代方案的一半,从而实现了这些性能优势。SuperMalloc利用了这样一个事实:尽管物理内存总是很宝贵,但64位机器上的虚拟地址空间相对便宜。它分配2个块,其中包含所有相同大小的对象。为了将块编号转换为块元数据,SuperMalloc使用一个简单的数组(其中大部分未提交到物理内存中)。SuperMalloc小心地避免了缓存中的关联性冲突:大多数大小类都是素数的缓存行,不对齐的大访问在页面中是随机对齐的。对象以适当的大小类从最满的非满页分配。对于每个大小类,SuperMalloc使用一个每线程10个对象的缓存,一个每cpu的缓存,每个大小类保存大约一个2级缓存值的对象,以及一个全局缓存,它被组织成允许使用$O(1)$指令在每cpu缓存和全局缓存之间移动许多对象。SuperMalloc在开始临界区之前预取它所能获取的所有内容,这使得临界区运行得更快,并且对于HTM提高了事务提交的几率。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信