Performance of multi-process and multi-thread processing on multi-core SMT processors

IEEE International Symposium on Workload Characterization (IISWC'10) Pub Date : 2010-12-02 DOI:10.1109/IISWC.2010.5650174

H. Inoue, T. Nakatani

{"title":"Performance of multi-process and multi-thread processing on multi-core SMT processors","authors":"H. Inoue, T. Nakatani","doi":"10.1109/IISWC.2010.5650174","DOIUrl":null,"url":null,"abstract":"Many modern high-performance processors support multiple hardware threads in the form of multiple cores and SMT (Simultaneous Multi-Threading). Hence achieving good performance scalability of programs with respect to the numbers of cores (core scalability) and SMT threads in one core (SMT scalability) is critical. To identify a way to achieve higher performance on the multi-core SMT processors, this paper compares the performance scalability with two parallelization models (using multiple processes and using multiple threads in one process) on two types of hardware parallelism (core scalability and SMT scalability). We tested standard Java benchmarks and a real-world server program written in PHP on two platforms, Sun's UltraSPARC T1 (Niagara) processor and Intel's Xeon (Nehalem) processor. We show that the multi-thread model achieves better SMT scalability compared to the multi-process model by reducing the number of cache misses and DTLB misses. However both models achieve roughly equal core scalability. We show that the multi-thread model generates up to 7.4 times more DTLB misses than the multi-process model when multiple cores are used. To take advantage of the both models, we implemented a memory allocator for a PHP runtime to reduce DTLB misses on multi-core SMT processors. The allocator is aware of the core that is running each software thread and allocates memory blocks from same memory page for each processor core. When using all of the hardware threads on a Niagara, the core-aware allocator reduces the DTLB misses by 46.7% compared to the default allocator, and it improves the performance by 3.0%.","PeriodicalId":107589,"journal":{"name":"IEEE International Symposium on Workload Characterization (IISWC'10)","volume":"s3-50 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"18","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE International Symposium on Workload Characterization (IISWC'10)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IISWC.2010.5650174","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 18

Abstract

Many modern high-performance processors support multiple hardware threads in the form of multiple cores and SMT (Simultaneous Multi-Threading). Hence achieving good performance scalability of programs with respect to the numbers of cores (core scalability) and SMT threads in one core (SMT scalability) is critical. To identify a way to achieve higher performance on the multi-core SMT processors, this paper compares the performance scalability with two parallelization models (using multiple processes and using multiple threads in one process) on two types of hardware parallelism (core scalability and SMT scalability). We tested standard Java benchmarks and a real-world server program written in PHP on two platforms, Sun's UltraSPARC T1 (Niagara) processor and Intel's Xeon (Nehalem) processor. We show that the multi-thread model achieves better SMT scalability compared to the multi-process model by reducing the number of cache misses and DTLB misses. However both models achieve roughly equal core scalability. We show that the multi-thread model generates up to 7.4 times more DTLB misses than the multi-process model when multiple cores are used. To take advantage of the both models, we implemented a memory allocator for a PHP runtime to reduce DTLB misses on multi-core SMT processors. The allocator is aware of the core that is running each software thread and allocates memory blocks from same memory page for each processor core. When using all of the hardware threads on a Niagara, the core-aware allocator reduces the DTLB misses by 46.7% compared to the default allocator, and it improves the performance by 3.0%.

查看原文本刊更多论文

多核SMT处理器上的多进程和多线程处理性能

许多现代高性能处理器以多核和SMT(同步多线程)的形式支持多个硬件线程。因此，在核数(核心可伸缩性)和一个核中的SMT线程数(SMT可伸缩性)方面实现程序的良好性能可伸缩性是至关重要的。为了确定在多核SMT处理器上实现更高性能的方法，本文在两种硬件并行性(核心可伸缩性和SMT可伸缩性)上比较了两种并行化模型(使用多个进程和在一个进程中使用多个线程)的性能可伸缩性。我们在Sun的UltraSPARC T1 (Niagara)处理器和Intel的Xeon (Nehalem)处理器两个平台上测试了标准Java基准测试和一个用PHP编写的真实服务器程序。我们表明，与多进程模型相比，多线程模型通过减少缓存缺失和DTLB缺失的数量实现了更好的SMT可伸缩性。然而，这两种模型实现了大致相同的核心可伸缩性。我们表明，当使用多个内核时，多线程模型产生的DTLB遗漏比多进程模型多7.4倍。为了利用这两种模型，我们为PHP运行时实现了一个内存分配器，以减少多核SMT处理器上的DTLB丢失。分配器知道正在运行每个软件线程的内核，并从相同的内存页为每个处理器内核分配内存块。当在Niagara上使用所有硬件线程时，与默认分配器相比，内核感知分配器减少了46.7%的DTLB缺失，并将性能提高了3.0%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE International Symposium on Workload Characterization (IISWC'10)

自引率

0.00%

发文量