Analyzing and improving performance scalability of commercial server workloads on a chip multiprocessor

2009 IEEE International Symposium on Workload Characterization (IISWC) Pub Date : 2009-10-04 DOI:10.1109/IISWC.2009.5306781

K. Ishizaki, T. Nakatani, S. Daijavad

{"title":"Analyzing and improving performance scalability of commercial server workloads on a chip multiprocessor","authors":"K. Ishizaki, T. Nakatani, S. Daijavad","doi":"10.1109/IISWC.2009.5306781","DOIUrl":null,"url":null,"abstract":"A chip multiprocessor (CMP) with many low performance cores can achieve high performance or high performance/power for commercial server applications. The large number of hardware threads of a CMP with many low performance cores poses significant challenges to application developers in writing scalable applications. Many papers have assessed the architectural characteristics and the performance scalability, and some of them have identified lock contention as one of the scalability bottlenecks. However, there are few studies that resolved these problems, analyzed their causes, and compared the architectural characteristics before and after the scalability limitations were addressed. We analyzed and resolved some of the problems limiting the scalability of three commercial server applications with 64 hardware threads. We also did before and after comparisons of the architectural characteristics affected by the scalability enhancements, supporting the development of new processors. We addressed the lock contention with changes in the Java code. Our enhancements improved the performance scalability by up to 132%. We show that though the causes of lock contention are in different software layers, they share certain similarities and can be organized in three categories. Our comparisons reveal that the CPI and data TLB miss rates decrease, but the L2 data cache miss rates, L2 instruction cache miss rates, and memory traffic increase. These results suggest that we need to address the performance scalability problems of an application before we can accurately measure the architectural characteristics of a CMP.","PeriodicalId":387816,"journal":{"name":"2009 IEEE International Symposium on Workload Characterization (IISWC)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-10-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 IEEE International Symposium on Workload Characterization (IISWC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IISWC.2009.5306781","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

Abstract

A chip multiprocessor (CMP) with many low performance cores can achieve high performance or high performance/power for commercial server applications. The large number of hardware threads of a CMP with many low performance cores poses significant challenges to application developers in writing scalable applications. Many papers have assessed the architectural characteristics and the performance scalability, and some of them have identified lock contention as one of the scalability bottlenecks. However, there are few studies that resolved these problems, analyzed their causes, and compared the architectural characteristics before and after the scalability limitations were addressed. We analyzed and resolved some of the problems limiting the scalability of three commercial server applications with 64 hardware threads. We also did before and after comparisons of the architectural characteristics affected by the scalability enhancements, supporting the development of new processors. We addressed the lock contention with changes in the Java code. Our enhancements improved the performance scalability by up to 132%. We show that though the causes of lock contention are in different software layers, they share certain similarities and can be organized in three categories. Our comparisons reveal that the CPI and data TLB miss rates decrease, but the L2 data cache miss rates, L2 instruction cache miss rates, and memory traffic increase. These results suggest that we need to address the performance scalability problems of an application before we can accurately measure the architectural characteristics of a CMP.

查看原文本刊更多论文

在芯片多处理器上分析和改进商业服务器工作负载的性能可伸缩性

具有许多低性能核心的芯片多处理器(CMP)可以为商业服务器应用程序实现高性能或高性能/高功耗。具有许多低性能核心的CMP的大量硬件线程给应用程序开发人员编写可伸缩应用程序带来了重大挑战。许多论文对系统的体系结构特征和性能可伸缩性进行了评估，其中一些论文认为锁争用是可伸缩性瓶颈之一。然而，很少有研究解决了这些问题，分析了其原因，并比较了解决可伸缩性限制前后的体系结构特征。我们分析并解决了限制三个具有64个硬件线程的商业服务器应用程序的可伸缩性的一些问题。我们还对受可伸缩性增强影响的体系结构特征进行了前后比较，以支持新处理器的开发。我们通过更改Java代码来解决锁争用问题。我们的改进将性能可伸缩性提高了132%。我们表明，尽管锁争用的原因在不同的软件层中，但它们有某些相似之处，可以分为三类。我们的比较表明，CPI和数据TLB缺失率下降，但L2数据缓存缺失率，L2指令缓存缺失率和内存流量增加。这些结果表明，我们需要先解决应用程序的性能可伸缩性问题，然后才能准确地测量CMP的体系结构特征。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2009 IEEE International Symposium on Workload Characterization (IISWC)

自引率

0.00%

发文量