Evaluating the Cost of Atomic Operations on Modern Architectures

Hermann Schweizer, Maciej Besta, T. Hoefler
{"title":"Evaluating the Cost of Atomic Operations on Modern Architectures","authors":"Hermann Schweizer, Maciej Besta, T. Hoefler","doi":"10.1109/PACT.2015.24","DOIUrl":null,"url":null,"abstract":"Atomic operations (atomics) such as Compare-and-Swap (CAS) or Fetch-and-Add (FAA) are ubiquitous in parallel programming. Yet, performance tradeoffs between these operations and various characteristics of such systems, such as the structure of caches, are unclear and have not been thoroughly analyzed. In this paper we establish an evaluation methodology, develop a performance model, and present a set of detailed benchmarks for latency and bandwidth of different atomics. We consider various state-of-the-art x86 architectures: Intel Haswell, Xeon Phi, Ivy Bridge, and AMD Bulldozer. The results unveil surprising performance relationships between the considered atomics and architectural properties such as the coherence state of the accessed cache lines. One key finding is that all the tested atomics have comparable latency and bandwidth even if they are characterized by different consensus numbers. Another insight is that the design of atomics prevents any instruction-level parallelism even if there are no dependencies between the issued operations. Finally, we discuss solutions to the discovered performance issues in the analyzed architectures. Our analysis can be used for making better design and algorithmic decisions in parallel programming on various architectures deployed in both off-the-shelf machines and large compute systems.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"87","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on Parallel Architecture and Compilation (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PACT.2015.24","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 87

Abstract

Atomic operations (atomics) such as Compare-and-Swap (CAS) or Fetch-and-Add (FAA) are ubiquitous in parallel programming. Yet, performance tradeoffs between these operations and various characteristics of such systems, such as the structure of caches, are unclear and have not been thoroughly analyzed. In this paper we establish an evaluation methodology, develop a performance model, and present a set of detailed benchmarks for latency and bandwidth of different atomics. We consider various state-of-the-art x86 architectures: Intel Haswell, Xeon Phi, Ivy Bridge, and AMD Bulldozer. The results unveil surprising performance relationships between the considered atomics and architectural properties such as the coherence state of the accessed cache lines. One key finding is that all the tested atomics have comparable latency and bandwidth even if they are characterized by different consensus numbers. Another insight is that the design of atomics prevents any instruction-level parallelism even if there are no dependencies between the issued operations. Finally, we discuss solutions to the discovered performance issues in the analyzed architectures. Our analysis can be used for making better design and algorithmic decisions in parallel programming on various architectures deployed in both off-the-shelf machines and large compute systems.
评估现代体系结构中原子操作的成本
原子操作(原子),如比较与交换(CAS)或获取与添加(FAA),在并行编程中无处不在。然而,这些操作和这些系统的各种特征(比如缓存的结构)之间的性能权衡并不清楚,也没有得到彻底的分析。在本文中,我们建立了一种评估方法,开发了一个性能模型,并为不同原子的延迟和带宽提供了一组详细的基准测试。我们考虑了各种最先进的x86架构:英特尔Haswell, Xeon Phi, Ivy Bridge和AMD推土机。结果揭示了所考虑的原子和架构属性(如访问的缓存线的相干状态)之间令人惊讶的性能关系。一个关键的发现是,所有被测试的原子都具有相似的延迟和带宽,即使它们具有不同的共识数。另一个观点是,原子的设计阻止了任何指令级并行性,即使发出的操作之间没有依赖关系。最后,我们讨论了在所分析的体系结构中发现的性能问题的解决方案。我们的分析可用于在部署在现成机器和大型计算系统中的各种架构上的并行编程中做出更好的设计和算法决策。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信