Dynamic cache contention detection in multi-threaded applications

International Conference on Virtual Execution Environments Pub Date : 2011-03-09 DOI:10.1145/1952682.1952688

Qin Zhao, David Koh, Syed Raza, Derek Bruening, W. Wong, Saman P. Amarasinghe

{"title":"Dynamic cache contention detection in multi-threaded applications","authors":"Qin Zhao, David Koh, Syed Raza, Derek Bruening, W. Wong, Saman P. Amarasinghe","doi":"10.1145/1952682.1952688","DOIUrl":null,"url":null,"abstract":"In today's multi-core systems, cache contention due to true and false sharing can cause unexpected and significant performance degradation. A detailed understanding of a given multi-threaded application's behavior is required to precisely identify such performance bottlenecks. Traditionally, however, such diagnostic information can only be obtained after lengthy simulation of the memory hierarchy.\n In this paper, we present a novel approach that efficiently analyzes interactions between threads to determine thread correlation and detect true and false sharing. It is based on the following key insight: although the slowdown caused by cache contention depends on factors including the thread-to-core binding and parameters of the memory hierarchy, the amount of data sharing is primarily a function of the cache line size and application behavior. Using memory shadowing and dynamic instrumentation, we implemented a tool that obtains detailed sharing information between threads without simulating the full complexity of the memory hierarchy. The runtime overhead of our approach --- a 5x slowdown on average relative to native execution --- is significantly less than that of detailed cache simulation. The information collected allows programmers to identify the degree of cache contention in an application, the correlation among its threads, and the sources of significant false sharing. Using our approach, we were able to improve the performance of some applications up to a factor of 12x. For other contention-intensive applications, we were able to shed light on the obstacles that prevent their performance from scaling to many cores.","PeriodicalId":202844,"journal":{"name":"International Conference on Virtual Execution Environments","volume":"38 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-03-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"76","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Conference on Virtual Execution Environments","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1952682.1952688","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 76

Abstract

In today's multi-core systems, cache contention due to true and false sharing can cause unexpected and significant performance degradation. A detailed understanding of a given multi-threaded application's behavior is required to precisely identify such performance bottlenecks. Traditionally, however, such diagnostic information can only be obtained after lengthy simulation of the memory hierarchy. In this paper, we present a novel approach that efficiently analyzes interactions between threads to determine thread correlation and detect true and false sharing. It is based on the following key insight: although the slowdown caused by cache contention depends on factors including the thread-to-core binding and parameters of the memory hierarchy, the amount of data sharing is primarily a function of the cache line size and application behavior. Using memory shadowing and dynamic instrumentation, we implemented a tool that obtains detailed sharing information between threads without simulating the full complexity of the memory hierarchy. The runtime overhead of our approach --- a 5x slowdown on average relative to native execution --- is significantly less than that of detailed cache simulation. The information collected allows programmers to identify the degree of cache contention in an application, the correlation among its threads, and the sources of significant false sharing. Using our approach, we were able to improve the performance of some applications up to a factor of 12x. For other contention-intensive applications, we were able to shed light on the obstacles that prevent their performance from scaling to many cores.

查看原文本刊更多论文

多线程应用程序中的动态缓存争用检测

在当今的多核系统中，由于真假共享导致的缓存争用可能会导致意想不到的显著性能下降。需要详细了解给定多线程应用程序的行为，才能准确地识别此类性能瓶颈。然而，传统上，这种诊断信息只能在长时间模拟内存层次结构之后获得。在本文中，我们提出了一种新的方法，可以有效地分析线程之间的相互作用，以确定线程相关性并检测真假共享。它基于以下关键见解:尽管缓存争用导致的速度减慢取决于包括线程到核心绑定和内存层次结构参数在内的因素，但数据共享的数量主要是缓存行大小和应用程序行为的函数。使用内存阴影和动态检测，我们实现了一个工具，它可以在不模拟内存层次结构的全部复杂性的情况下获得线程之间的详细共享信息。我们的方法的运行时开销——相对于本机执行的平均速度降低5倍——明显低于详细的缓存模拟。收集到的信息使程序员能够确定应用程序中缓存争用的程度、线程之间的相关性以及重要错误共享的来源。使用我们的方法，我们能够将一些应用程序的性能提高12倍。对于其他竞争密集型应用程序，我们能够阐明阻碍其性能扩展到多个核心的障碍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

International Conference on Virtual Execution Environments

自引率

0.00%

发文量