HPC Process and Optimal Network Device Affinitization

IEEE Transactions on Multi-Scale Computing Systems Pub Date : 2018-09-20 DOI:10.1109/TMSCS.2018.2871444

Ravindra Babu Ganapathi;Aravind Gopalakrishnan;Russell W. McGuire

{"title":"HPC Process and Optimal Network Device Affinitization","authors":"Ravindra Babu Ganapathi;Aravind Gopalakrishnan;Russell W. McGuire","doi":"10.1109/TMSCS.2018.2871444","DOIUrl":null,"url":null,"abstract":"High Performance Computing (HPC) applications have demanding need for hardware resources such as processor, memory, and storage. Applications in the area of Artificial Intelligence and Machine Learning are taking center stage in HPC, which is driving demand for increasing compute resources per node which in turn is pushing bandwidth requirement between the compute nodes. New system design paradigms exist where deploying a system with more than one high performance IO device per node provides benefits. The number of I/O devices connected to the HPC node can be increased with PCIe switches and hence some of the HPC nodes are designed to include PCIe switches to provide a large number of PCIe slots. With multiple IO devices per node, application programmers are forced to consider HPC process affinity to not only compute resources but extend this to include IO devices. Mapping of process to processor cores and the closest IO device(s) increases complexity due to three way mapping and varying HPC node architectures. While operating systems perform reasonable mapping of process to processor core(s), they lack the application developer's knowledge of process workflow and optimal IO resource allocation when more than one IO device is attached to the compute node. This paper is an extended version of our work published in \n<xref>[1]</xref>\n . Our previous work provided solution for IO device affinity choices by abstracting the device selection algorithm from HPC applications. In this paper, we extend the affinity solution to enable OpenFabric Interfaces (OFI) which is a generic HPC API designed as part of the OpenFabrics Alliance that enables wider HPC programming models and applications supported by various HPC fabric vendors. We present a solution for IO device affinity choices by abstracting the device selection algorithm from HPC applications. MPI continues to be the dominant programming model for HPC and hence we provide evaluation with MPI based micro benchmarks. Our solution is then extended to OpenFabric Interfaces which supports other HPC programming models such as SHMEM, GASNet, and UPC. We propose a solution to solve NUMA issues at the lower level of the software stack that forms the runtime for MPI and other programming models independent of HPC applications. Our experiments are conducted on a two node system where each node consists of two socket Intel Xeon servers, attached with up to four Intel Omni-Path fabric devices connected over PCIe. The performance benefits seen by applications by affinitizing processes with best possible network device is evident from the results where we notice up to 40 percent improvement in uni-directional bandwidth, 48 percent bi-directional bandwidth, 32 percent improvement in latency measurements, and up to 40 percent improvement in message rate with OSU benchmark suite. We also extend our evaluation to include OFI operations and an MPI benchmark used for Genome assembly. With OFI Remote Memory Access (RMA) operations we see a bandwidth improvement of 32 percent for fi_read and 22 percent with fi_write operations, and also latency improvement of 15 percent for fi_read and 14 percent for fi_write. K-mer MMatching Interface HASH benchmark shows an improvement of up to 25 percent while using local network device versus using a network device connected to remote Xeon socket.","PeriodicalId":100643,"journal":{"name":"IEEE Transactions on Multi-Scale Computing Systems","volume":"4 4","pages":"749-757"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/TMSCS.2018.2871444","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Multi-Scale Computing Systems","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/8469016/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

High Performance Computing (HPC) applications have demanding need for hardware resources such as processor, memory, and storage. Applications in the area of Artificial Intelligence and Machine Learning are taking center stage in HPC, which is driving demand for increasing compute resources per node which in turn is pushing bandwidth requirement between the compute nodes. New system design paradigms exist where deploying a system with more than one high performance IO device per node provides benefits. The number of I/O devices connected to the HPC node can be increased with PCIe switches and hence some of the HPC nodes are designed to include PCIe switches to provide a large number of PCIe slots. With multiple IO devices per node, application programmers are forced to consider HPC process affinity to not only compute resources but extend this to include IO devices. Mapping of process to processor cores and the closest IO device(s) increases complexity due to three way mapping and varying HPC node architectures. While operating systems perform reasonable mapping of process to processor core(s), they lack the application developer's knowledge of process workflow and optimal IO resource allocation when more than one IO device is attached to the compute node. This paper is an extended version of our work published in [1] . Our previous work provided solution for IO device affinity choices by abstracting the device selection algorithm from HPC applications. In this paper, we extend the affinity solution to enable OpenFabric Interfaces (OFI) which is a generic HPC API designed as part of the OpenFabrics Alliance that enables wider HPC programming models and applications supported by various HPC fabric vendors. We present a solution for IO device affinity choices by abstracting the device selection algorithm from HPC applications. MPI continues to be the dominant programming model for HPC and hence we provide evaluation with MPI based micro benchmarks. Our solution is then extended to OpenFabric Interfaces which supports other HPC programming models such as SHMEM, GASNet, and UPC. We propose a solution to solve NUMA issues at the lower level of the software stack that forms the runtime for MPI and other programming models independent of HPC applications. Our experiments are conducted on a two node system where each node consists of two socket Intel Xeon servers, attached with up to four Intel Omni-Path fabric devices connected over PCIe. The performance benefits seen by applications by affinitizing processes with best possible network device is evident from the results where we notice up to 40 percent improvement in uni-directional bandwidth, 48 percent bi-directional bandwidth, 32 percent improvement in latency measurements, and up to 40 percent improvement in message rate with OSU benchmark suite. We also extend our evaluation to include OFI operations and an MPI benchmark used for Genome assembly. With OFI Remote Memory Access (RMA) operations we see a bandwidth improvement of 32 percent for fi_read and 22 percent with fi_write operations, and also latency improvement of 15 percent for fi_read and 14 percent for fi_write. K-mer MMatching Interface HASH benchmark shows an improvement of up to 25 percent while using local network device versus using a network device connected to remote Xeon socket.

查看原文本刊更多论文

HPC过程与最优网络设备仿射化

高性能计算（HPC）应用程序对处理器、内存和存储等硬件资源有着苛刻的需求。人工智能和机器学习领域的应用正在HPC中占据中心地位，这推动了对增加每个节点的计算资源的需求，反过来又推动了计算节点之间的带宽需求。存在新的系统设计范例，其中部署每个节点具有一个以上高性能IO设备的系统提供了好处。连接到HPC节点的I/O设备的数量可以通过PCIe交换机来增加，因此一些HPC节点被设计为包括PCIe交换机以提供大量PCIe插槽。由于每个节点有多个IO设备，应用程序程序员被迫考虑HPC进程相关性，不仅要计算资源，还要将其扩展到包括IO设备。由于三向映射和变化的HPC节点架构，将进程映射到处理器核心和最近的IO设备增加了复杂性。虽然操作系统执行进程到处理器核心的合理映射，但当多个IO设备连接到计算节点时，它们缺乏应用程序开发人员对进程工作流和最佳IO资源分配的了解。本文是我们在[1]中发表的工作的扩展版本。我们之前的工作通过从HPC应用程序中抽象设备选择算法，为IO设备仿射选择提供了解决方案。在本文中，我们扩展了亲和性解决方案，以启用OpenFabric接口（OFI），这是一种通用的HPC API，设计为OpenFabrics联盟的一部分，可实现各种HPC结构供应商支持的更广泛的HPC编程模型和应用程序。我们通过从HPC应用程序中抽象设备选择算法，提出了一种IO设备仿射选择的解决方案。MPI仍然是HPC的主要编程模型，因此我们提供了基于MPI的微基准测试的评估。然后，我们的解决方案扩展到OpenFabric接口，它支持其他HPC编程模型，如SHMEM、GASNet和UPC。我们提出了一种在软件堆栈的较低级别解决NUMA问题的解决方案，该软件堆栈形成了独立于HPC应用程序的MPI和其他编程模型的运行时。我们的实验是在一个双节点系统上进行的，其中每个节点由两个插槽Intel Xeon服务器组成，连接有最多四个通过PCIe连接的Intel Omni-Path结构设备。通过将进程与尽可能好的网络设备相关联，应用程序所看到的性能优势从结果中显而易见，我们注意到使用OSU基准套件，单向带宽提高了40%，双向带宽提高了48%，延迟测量提高了32%，消息速率提高了40%。我们还将我们的评估扩展到包括OFI操作和用于基因组组装的MPI基准。使用OFI远程内存访问（RMA）操作，我们发现fi_read和fi_write操作的带宽分别提高了32%和22%，fi_read的延迟和fi_wwrite的延迟分别提高了15%和14%。K-mer MMatching Interface HASH基准测试显示，与使用连接到远程Xeon套接字的网络设备相比，使用本地网络设备的性能提高了25%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

IEEE Transactions on Multi-Scale Computing Systems

自引率

0.00%

发文量