Evaluating tradeoffs between MPI message matching offload hardware capacity and performance

Proceedings of the 26th European MPI Users' Group Meeting Pub Date : 2019-09-11 DOI:10.1145/3343211.3343223

Scott Levy, Kurt B. Ferreira

{"title":"Evaluating tradeoffs between MPI message matching offload hardware capacity and performance","authors":"Scott Levy, Kurt B. Ferreira","doi":"10.1145/3343211.3343223","DOIUrl":null,"url":null,"abstract":"Although its demise has been frequently predicted, the Message Passing Interface (MPI) remains the dominant programming model for scientific applications running on high-performance computing (HPC) systems. MPI specifies powerful semantics for interprocess communication that have enabled scientists to write applications for simulating important physical phenomena. However, these semantics have also presented several significant challenges. For example, the existence of wildcard values has made the efficient enforcement of MPI message matching semantics challenging. Significant research has been dedicated to accelerating MPI message matching. One common approach has been to offload matching to dedicated hardware. One of the challenges that hardware designers have faced is knowing how to size hardware structures to accommodate outstanding match requests. Applications that exceed the capacity of specialized hardware typically must fall back to storing match requests in bulk memory, e.g. DRAM on the host processor. In this paper, we examine the implications of hardware matching and develop guidance on sizing hardware matching structure to strike a balance between minimizing expensive dedicated hardware resources and overall matching performance. By examining the message matching behavior of several important HPC workloads, we show that when specialized hardware matching is not dramatically faster than matching in memory the offload hardware's match queue capacity can be reduced without significantly increasing match time. On the other hand, effectively exploiting the benefits of very fast specialized matching hardware requires sufficient storage resources to ensure that every search completes in the specialized hardware. The data and analysis in this paper provide important guidance for designers of MPI message matching hardware.","PeriodicalId":314904,"journal":{"name":"Proceedings of the 26th European MPI Users' Group Meeting","volume":"88 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 26th European MPI Users' Group Meeting","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3343211.3343223","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

Abstract

Although its demise has been frequently predicted, the Message Passing Interface (MPI) remains the dominant programming model for scientific applications running on high-performance computing (HPC) systems. MPI specifies powerful semantics for interprocess communication that have enabled scientists to write applications for simulating important physical phenomena. However, these semantics have also presented several significant challenges. For example, the existence of wildcard values has made the efficient enforcement of MPI message matching semantics challenging. Significant research has been dedicated to accelerating MPI message matching. One common approach has been to offload matching to dedicated hardware. One of the challenges that hardware designers have faced is knowing how to size hardware structures to accommodate outstanding match requests. Applications that exceed the capacity of specialized hardware typically must fall back to storing match requests in bulk memory, e.g. DRAM on the host processor. In this paper, we examine the implications of hardware matching and develop guidance on sizing hardware matching structure to strike a balance between minimizing expensive dedicated hardware resources and overall matching performance. By examining the message matching behavior of several important HPC workloads, we show that when specialized hardware matching is not dramatically faster than matching in memory the offload hardware's match queue capacity can be reduced without significantly increasing match time. On the other hand, effectively exploiting the benefits of very fast specialized matching hardware requires sufficient storage resources to ensure that every search completes in the specialized hardware. The data and analysis in this paper provide important guidance for designers of MPI message matching hardware.

查看原文本刊更多论文

评估MPI消息匹配卸载硬件容量和性能之间的权衡

尽管人们经常预言消息传递接口(Message Passing Interface, MPI)的消亡，但它仍然是运行在高性能计算(HPC)系统上的科学应用程序的主要编程模型。MPI为进程间通信指定了强大的语义，使科学家能够编写用于模拟重要物理现象的应用程序。然而，这些语义也提出了一些重大的挑战。例如，通配符值的存在使得MPI消息匹配语义的有效实施具有挑战性。在加速MPI报文匹配方面已经进行了大量的研究。一种常见的方法是将匹配任务卸载到专用硬件上。硬件设计人员面临的挑战之一是知道如何调整硬件结构以适应出色的匹配请求。超出专用硬件容量的应用程序通常必须退回到将匹配请求存储在大块内存中，例如主机处理器上的DRAM。在本文中，我们研究了硬件匹配的含义，并制定了硬件匹配结构的大小指导，以在最小化昂贵的专用硬件资源和整体匹配性能之间取得平衡。通过研究几个重要HPC工作负载的消息匹配行为，我们发现，当专用硬件匹配的速度并不比内存中的匹配快得多时，可以在不显著增加匹配时间的情况下减少卸载硬件的匹配队列容量。另一方面，有效地利用非常快速的专用匹配硬件的好处需要足够的存储资源，以确保每个搜索都在专用硬件中完成。本文的数据和分析对MPI报文匹配硬件的设计具有重要的指导意义。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the 26th European MPI Users' Group Meeting

自引率

0.00%

发文量