Multi-threaded UPC runtime with network endpoints: Design alternatives and evaluation on multi-core architectures

2011 18th International Conference on High Performance Computing Pub Date : 2011-12-18 DOI:10.1109/HiPC.2011.6152734

Miao Luo, Jithin Jose, S. Sur, D. Panda

{"title":"Multi-threaded UPC runtime with network endpoints: Design alternatives and evaluation on multi-core architectures","authors":"Miao Luo, Jithin Jose, S. Sur, D. Panda","doi":"10.1109/HiPC.2011.6152734","DOIUrl":null,"url":null,"abstract":"Multi-core architectures are becoming more and more popular in HEC (High End Computing) era. Recent trends of high-productivity computing in conjunction with advanced multi-core and network architectures have increased the interest in Global Address Space (PGAS) languages, due to its high-productivity feature and better applicability. Unified Parallel C (UPC) is an emerging PGAS language. In this paper, we compare different design alternatives for a high-performance and scalable UPC runtime on multi-core nodes, from several aspects: performance, portability, interoperability and support for irregular parallelism. Based on our analysis, we present a novel design of a multi-threaded UPC runtime that supports multi-endpoints. Our runtime is able to dramatically decrease network access contention resulting in 80% lower latency for fine-grained memget/memput operations and almost doubling the bandwidth for medium size messages, compared to multi-threaded Berkeley UPC Runtime. Furthermore, the multi-endpoint design opens up new doors for runtime optimizations — such as support for irregular parallelism. We utilize true network helper threads and load-balancing via work stealing in the runtime. Our evaluation with novel benchmarks shows that our runtime can achieve 90% of the peak efficiency, which is a factor of 1.3 times better than existing Berkeley UPC Runtime. To the best of our knowledge, this is the first work in which multi-network endpoint capable UPC runtime design is proposed for modern multi-core systems.","PeriodicalId":122468,"journal":{"name":"2011 18th International Conference on High Performance Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"15","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 18th International Conference on High Performance Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HiPC.2011.6152734","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 15

Abstract

Multi-core architectures are becoming more and more popular in HEC (High End Computing) era. Recent trends of high-productivity computing in conjunction with advanced multi-core and network architectures have increased the interest in Global Address Space (PGAS) languages, due to its high-productivity feature and better applicability. Unified Parallel C (UPC) is an emerging PGAS language. In this paper, we compare different design alternatives for a high-performance and scalable UPC runtime on multi-core nodes, from several aspects: performance, portability, interoperability and support for irregular parallelism. Based on our analysis, we present a novel design of a multi-threaded UPC runtime that supports multi-endpoints. Our runtime is able to dramatically decrease network access contention resulting in 80% lower latency for fine-grained memget/memput operations and almost doubling the bandwidth for medium size messages, compared to multi-threaded Berkeley UPC Runtime. Furthermore, the multi-endpoint design opens up new doors for runtime optimizations — such as support for irregular parallelism. We utilize true network helper threads and load-balancing via work stealing in the runtime. Our evaluation with novel benchmarks shows that our runtime can achieve 90% of the peak efficiency, which is a factor of 1.3 times better than existing Berkeley UPC Runtime. To the best of our knowledge, this is the first work in which multi-network endpoint capable UPC runtime design is proposed for modern multi-core systems.

查看原文本刊更多论文

带有网络端点的多线程UPC运行时:多核架构的设计选择和评估

在高端计算(HEC)时代，多核架构越来越受欢迎。由于全球地址空间(Global Address Space, PGAS)语言的高生产率特性和更好的适用性，与先进的多核和网络架构相结合的高生产率计算的最新趋势增加了人们对其的兴趣。统一并行C语言(UPC)是一种新兴的PGAS语言。在本文中，我们从性能、可移植性、互操作性和对不规则并行性的支持等几个方面比较了多核节点上高性能和可扩展UPC运行时的不同设计方案。基于我们的分析，我们提出了一种支持多端点的多线程UPC运行时的新设计。与多线程伯克利UPC运行时相比，我们的运行时能够显著减少网络访问争用，从而将细粒度memget/memput操作的延迟降低80%，并且几乎将中等大小消息的带宽提高了一倍。此外，多端点设计为运行时优化打开了新的大门——比如对不规则并行性的支持。我们利用真正的网络助手线程，并通过在运行时窃取工作来实现负载平衡。我们对新基准的评估表明，我们的运行时可以达到90%的峰值效率，这是现有伯克利UPC运行时的1.3倍。据我们所知，这是第一个为现代多核系统提出具有多网络端点能力的UPC运行时设计的工作。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2011 18th International Conference on High Performance Computing

自引率

0.00%

发文量