Groute: Asynchronous Multi-GPU Programming Model with Applications to Large-scale Graph Processing
Tal Ben-Nun, M. Sutton, Sreepathi Pai, K. Pingali
{"title":"Groute: Asynchronous Multi-GPU Programming Model with Applications to Large-scale Graph Processing","authors":"Tal Ben-Nun, M. Sutton, Sreepathi Pai, K. Pingali","doi":"10.1145/3399730","DOIUrl":null,"url":null,"abstract":"ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2020 Association for Computing Machinery. 2329-4949/2020/06-ART18 $15.00 https://doi.org/10.1145/3399730 ACM Transactions on Parallel Computing, Vol. 7, No. 3, Article 18. Publication date: June 2020. 18:2 T. Ben-Nun et al. Fig. 1. Multi-GPU node schematics. via a low-latency, high-throughput bus (see Figure 1). These interconnects allow parallel applications to exchange data efficiently and to take advantage of the combined computational power and memory size of the GPUs, but may vary substantially between node types. Multi-GPU nodes are usually programmed using one of two methods. In the simple approach, each GPU is managed separately, using one process per device [19, 26]. Alternatively, a Bulk Synchronous Parallel (BSP) [42] programming model is used, in which applications are executed in rounds, and each round consists of local computation followed by global communication [6, 33]. The first approach is subject to overhead from various sources, such as the operating system, and requires a message-passing interface for communication. The BSP model, however, can introduce unnecessary serialization at the global barriers that implement round-based execution. Both programming methods may result in under-utilization of multi-GPU platforms, particularly for irregular applications, which may suffer from load imbalance and may have unpredictable communication patterns. In principle, asynchronous programming models can reduce some of those problems, because unlike in round-based communication, processors can compute and communicate autonomously without waiting for other processors to reach global barriers. However, there are few applications that exploit asynchronous execution, since their development requires an in-depth knowledge of the underlying architecture and communication network and involves performing intricate adaptations to the code. This article presents Groute, an asynchronous programming model and runtime environment [2] that can be used to develop a wide range of applications on multi-GPU systems. Based on concepts from low-level networking, Groute aims to overcome the programming complexity of asynchronous applications on multi-GPU and heterogeneous platforms. The communication constructs of Groute are simple, but they can be used to efficiently express programs that range from regular applications and BSP applications to nontrivial irregular algorithms. The asynchronous nature of the runtime environment also promotes load balancing, leading to better utilization of heterogeneous multi-GPU nodes. This article is an extended version of previously published work [7], where we explain the concepts in greater detail, consider newer multi-GPU topologies, and elaborate on the evaluated algorithms, as well as scalability considerations. The main contributions are the following: • We define abstract programming constructs for asynchronous execution and communication. • We show that these constructs can be used to define a variety of algorithms, including regular and irregular parallel algorithms. ACM Transactions on Parallel Computing, Vol. 7, No. 3, Article 18. Publication date: June 2020. Groute: Asynchronous Multi-GPU Programming Model 18:3 • We compare aspects of the performance of our implementations, using applications written in existing frameworks as benchmarks. • We show that using Groute, it is possible to implement asynchronous applications that in most cases outperform state-of-the-art implementations, yielding up to 7.32× speedup on eight GPUs compared to a baseline execution on a single GPU. 2 MULTI-GPU NODE ARCHITECTURE In general, the role of accelerators is to complement the available CPUs by allowing them to offload data-parallel portions of an application. The CPUs, in turn, are responsible for process management, communication, input/output tasks, memory transfers, and data pre/post-processing. As illustrated in Figure 1, the CPUs and accelerators are connected to each other via a Front-Side Bus (FSB, implementations include QPI and HyperTransport). The FSB lanes, whose count is an indicator of the memory transfer bandwidth, are linked to an interconnect such as PCI-Express or NVLink that supports both CPU-GPU and GPU-GPU communications. Due to limitations in the hardware layout, such as use of the same motherboard and power supply units, multi-GPU nodes typically consist of ∼1–25 GPUs. The topology of the CPUs, GPUs, and interconnect can vary between complete all-pair connections and a hierarchical switched topology, as shown in the figure. In the tree-topology shown in Figure 1(a), each quadruplet of GPUs (i.e., 1–4 and 5–8) can perform direct communication operations amongst themselves, but communications with the other quadruplet are indirect and thus slower. For example, GPUs 1 and 4 can perform direct communication, but data transfers from GPU 4 to 5 must pass through the interconnect. A switched interface allows each CPU to communicate with all GPUs at the same rate. In other configurations, CPUs are directly connected to their quadruplet of GPUs, which results in variable CPU-GPU bandwidth, depending on process placement. The GPU architecture contains multiple memory copy engines, enabling simultaneous code execution and two-way (input/output) memory transfer. Below, we elaborate on the different ways concurrent copies can be used to efficiently communicate within a multi-GPU node. 2.1 Inter-GPU Communication Memory transfers among GPUs are provided by the vendor runtime via implicit and explicit interfaces. For the former, abstractions such as Unified and Managed Memory make use of virtual memory to perform copies, paging, and prefetching. With explicit copies, however, the user maintains full control over how and when memory is transferred. When exact memory access patterns are known, it is generally preferable to explicitly control memory movement, as prefetching may hurt memory-latency bound applications, for instance. For this reason, we focus below on explicit inter-GPU communication. Explicit memory transfers among GPUs can either be initiated by the host or a device. Hostinitiated memory transfer (Peer Transfer) is supported by explicit copy commands, whereas deviceinitiated memory transfer (Direct Access, DA) is implemented using inter-GPU memory accesses. Note that direct access to peer memory may not be available between all pairs of GPUs, depending on the bus topology. Access to pinned host memory, however, is possible from all GPUs. Device-initiated memory transfers are implemented by virtual addressing, which maps all host and device memory to a single address space. While more flexible than peer transfers, DA performance is highly sensitive to memory alignment, coalescing, number of active threads, and order of access. Using microbenchmarks (Figure 2), we measure 100 MB transfers, averaged over 100 trials, on theeight-GPU system from our experimental setup (see Section 5 for detailed specifications). ACM Transactions on Parallel Computing, Vol. 7, No. 3, Article 18. Publication date: June 2020. 18:4 T. Ben-Nun et al. Fig. 2. Inter-GPU memory transfer microbenchmarks. Figure 2(a) shows the transfer rate of device-initiated memory access on GPUs that reside in the same board, on different boards, and CPU-GPU communication. The figure demonstrates the two extremes of the DA spectrum—from tightly managed coalesced access (blue bars, left-hand side) to random, unmanaged access (red bars, right-hand side). Observe that coalesced access performs up to 21× better than random access. Also notice that the memory transfer rate correlates with the distance of the path in the topology. Due to the added level of dual-board GPUs (shown in Figure 1(a)), CPU-GPU transfer is faster than two different-board GPUs. To support device-initiated transfers between GPUs that cannot access each other’s memory, it is possible to perform a two-phase indirect copy. In indirect copy, the source GPU “pushes” information to host memory first, after which it is “pulled” by the destination GPU using host flags and system-wide memory fences for synchronization. In topologies such as the one presented in Figure 1(a), GPUs can only transmit to one destination at a time. This hinders the responsiveness of an asynchronous system, especially when transferring large buffers. One way to resolve this issue is by dividing messages into packets, as in networking. Figure 2(b) presents the overhead of using packetized memory transfers as opposed to a single peer transfer. The figure shows that the overhead decreases linearly as the packet size increases, ranging between ∼1% and 10% for 1–10 MB packets. This parameter can be tuned by individual applications to balance between latency and bandwidth. Figure 2(c) compares the transfer rate of direct (push) and indirect (push/pull) transfers, showing that packetized device-initiated transfers and the fine-grained control is advantageous, even over the host-managed packetized peer transfers. Note that, since device-initiated memory access is written in user code, it is possible to perform additional data processing during transfer. Another important aspect of multi-GPU communication is multiple source/destination transfers, as in collective operations. Due to the structure of the interconnect and memory copy engines, a naive application is likely to congest the bus. One approach, used in the NCCL library [31], creates a ring topology over the bus. In this approach, illustrated in Figure 3, each GPU transfers to ACM Transactions on Parallel Computing, Vol. 7, No. 3, Article 18. Publication date: June 2020. Groute: Asynchronous Multi-GPU Programming Model 18:5 Fig. 3. DA Ring topology. Fig. 4. Single GPU architecture. one destination, communicating via direct or i","PeriodicalId":0,"journal":{"name":"","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3399730","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7
Abstract
ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2020 Association for Computing Machinery. 2329-4949/2020/06-ART18 $15.00 https://doi.org/10.1145/3399730 ACM Transactions on Parallel Computing, Vol. 7, No. 3, Article 18. Publication date: June 2020. 18:2 T. Ben-Nun et al. Fig. 1. Multi-GPU node schematics. via a low-latency, high-throughput bus (see Figure 1). These interconnects allow parallel applications to exchange data efficiently and to take advantage of the combined computational power and memory size of the GPUs, but may vary substantially between node types. Multi-GPU nodes are usually programmed using one of two methods. In the simple approach, each GPU is managed separately, using one process per device [19, 26]. Alternatively, a Bulk Synchronous Parallel (BSP) [42] programming model is used, in which applications are executed in rounds, and each round consists of local computation followed by global communication [6, 33]. The first approach is subject to overhead from various sources, such as the operating system, and requires a message-passing interface for communication. The BSP model, however, can introduce unnecessary serialization at the global barriers that implement round-based execution. Both programming methods may result in under-utilization of multi-GPU platforms, particularly for irregular applications, which may suffer from load imbalance and may have unpredictable communication patterns. In principle, asynchronous programming models can reduce some of those problems, because unlike in round-based communication, processors can compute and communicate autonomously without waiting for other processors to reach global barriers. However, there are few applications that exploit asynchronous execution, since their development requires an in-depth knowledge of the underlying architecture and communication network and involves performing intricate adaptations to the code. This article presents Groute, an asynchronous programming model and runtime environment [2] that can be used to develop a wide range of applications on multi-GPU systems. Based on concepts from low-level networking, Groute aims to overcome the programming complexity of asynchronous applications on multi-GPU and heterogeneous platforms. The communication constructs of Groute are simple, but they can be used to efficiently express programs that range from regular applications and BSP applications to nontrivial irregular algorithms. The asynchronous nature of the runtime environment also promotes load balancing, leading to better utilization of heterogeneous multi-GPU nodes. This article is an extended version of previously published work [7], where we explain the concepts in greater detail, consider newer multi-GPU topologies, and elaborate on the evaluated algorithms, as well as scalability considerations. The main contributions are the following: • We define abstract programming constructs for asynchronous execution and communication. • We show that these constructs can be used to define a variety of algorithms, including regular and irregular parallel algorithms. ACM Transactions on Parallel Computing, Vol. 7, No. 3, Article 18. Publication date: June 2020. Groute: Asynchronous Multi-GPU Programming Model 18:3 • We compare aspects of the performance of our implementations, using applications written in existing frameworks as benchmarks. • We show that using Groute, it is possible to implement asynchronous applications that in most cases outperform state-of-the-art implementations, yielding up to 7.32× speedup on eight GPUs compared to a baseline execution on a single GPU. 2 MULTI-GPU NODE ARCHITECTURE In general, the role of accelerators is to complement the available CPUs by allowing them to offload data-parallel portions of an application. The CPUs, in turn, are responsible for process management, communication, input/output tasks, memory transfers, and data pre/post-processing. As illustrated in Figure 1, the CPUs and accelerators are connected to each other via a Front-Side Bus (FSB, implementations include QPI and HyperTransport). The FSB lanes, whose count is an indicator of the memory transfer bandwidth, are linked to an interconnect such as PCI-Express or NVLink that supports both CPU-GPU and GPU-GPU communications. Due to limitations in the hardware layout, such as use of the same motherboard and power supply units, multi-GPU nodes typically consist of ∼1–25 GPUs. The topology of the CPUs, GPUs, and interconnect can vary between complete all-pair connections and a hierarchical switched topology, as shown in the figure. In the tree-topology shown in Figure 1(a), each quadruplet of GPUs (i.e., 1–4 and 5–8) can perform direct communication operations amongst themselves, but communications with the other quadruplet are indirect and thus slower. For example, GPUs 1 and 4 can perform direct communication, but data transfers from GPU 4 to 5 must pass through the interconnect. A switched interface allows each CPU to communicate with all GPUs at the same rate. In other configurations, CPUs are directly connected to their quadruplet of GPUs, which results in variable CPU-GPU bandwidth, depending on process placement. The GPU architecture contains multiple memory copy engines, enabling simultaneous code execution and two-way (input/output) memory transfer. Below, we elaborate on the different ways concurrent copies can be used to efficiently communicate within a multi-GPU node. 2.1 Inter-GPU Communication Memory transfers among GPUs are provided by the vendor runtime via implicit and explicit interfaces. For the former, abstractions such as Unified and Managed Memory make use of virtual memory to perform copies, paging, and prefetching. With explicit copies, however, the user maintains full control over how and when memory is transferred. When exact memory access patterns are known, it is generally preferable to explicitly control memory movement, as prefetching may hurt memory-latency bound applications, for instance. For this reason, we focus below on explicit inter-GPU communication. Explicit memory transfers among GPUs can either be initiated by the host or a device. Hostinitiated memory transfer (Peer Transfer) is supported by explicit copy commands, whereas deviceinitiated memory transfer (Direct Access, DA) is implemented using inter-GPU memory accesses. Note that direct access to peer memory may not be available between all pairs of GPUs, depending on the bus topology. Access to pinned host memory, however, is possible from all GPUs. Device-initiated memory transfers are implemented by virtual addressing, which maps all host and device memory to a single address space. While more flexible than peer transfers, DA performance is highly sensitive to memory alignment, coalescing, number of active threads, and order of access. Using microbenchmarks (Figure 2), we measure 100 MB transfers, averaged over 100 trials, on theeight-GPU system from our experimental setup (see Section 5 for detailed specifications). ACM Transactions on Parallel Computing, Vol. 7, No. 3, Article 18. Publication date: June 2020. 18:4 T. Ben-Nun et al. Fig. 2. Inter-GPU memory transfer microbenchmarks. Figure 2(a) shows the transfer rate of device-initiated memory access on GPUs that reside in the same board, on different boards, and CPU-GPU communication. The figure demonstrates the two extremes of the DA spectrum—from tightly managed coalesced access (blue bars, left-hand side) to random, unmanaged access (red bars, right-hand side). Observe that coalesced access performs up to 21× better than random access. Also notice that the memory transfer rate correlates with the distance of the path in the topology. Due to the added level of dual-board GPUs (shown in Figure 1(a)), CPU-GPU transfer is faster than two different-board GPUs. To support device-initiated transfers between GPUs that cannot access each other’s memory, it is possible to perform a two-phase indirect copy. In indirect copy, the source GPU “pushes” information to host memory first, after which it is “pulled” by the destination GPU using host flags and system-wide memory fences for synchronization. In topologies such as the one presented in Figure 1(a), GPUs can only transmit to one destination at a time. This hinders the responsiveness of an asynchronous system, especially when transferring large buffers. One way to resolve this issue is by dividing messages into packets, as in networking. Figure 2(b) presents the overhead of using packetized memory transfers as opposed to a single peer transfer. The figure shows that the overhead decreases linearly as the packet size increases, ranging between ∼1% and 10% for 1–10 MB packets. This parameter can be tuned by individual applications to balance between latency and bandwidth. Figure 2(c) compares the transfer rate of direct (push) and indirect (push/pull) transfers, showing that packetized device-initiated transfers and the fine-grained control is advantageous, even over the host-managed packetized peer transfers. Note that, since device-initiated memory access is written in user code, it is possible to perform additional data processing during transfer. Another important aspect of multi-GPU communication is multiple source/destination transfers, as in collective operations. Due to the structure of the interconnect and memory copy engines, a naive application is likely to congest the bus. One approach, used in the NCCL library [31], creates a ring topology over the bus. In this approach, illustrated in Figure 3, each GPU transfers to ACM Transactions on Parallel Computing, Vol. 7, No. 3, Article 18. Publication date: June 2020. Groute: Asynchronous Multi-GPU Programming Model 18:5 Fig. 3. DA Ring topology. Fig. 4. Single GPU architecture. one destination, communicating via direct or i
异步多gpu编程模型与大规模图形处理的应用
允许赊账付款。以其他方式复制或重新发布,在服务器上发布或重新分发到列表,需要事先获得特定许可和/或付费。从permissions@acm.org请求权限。©2020美国计算机协会。2329-4949/2020/06-ART18 $15.00 https://doi.org/10.1145/3399730 ACM并行计算学报,第7卷,第3期,第18条。出版日期:2020年6月。[18:2] T. Ben-Nun等。图1所示。多gpu节点原理图。通过低延迟、高吞吐量总线(见图1)。这些互连允许并行应用程序有效地交换数据,并利用gpu的综合计算能力和内存大小,但节点类型之间可能存在很大差异。多gpu节点通常使用两种方法中的一种进行编程。在简单的方法中,每个GPU被单独管理,每个设备使用一个进程[19,26]。另一种方法是使用Bulk Synchronous Parallel (BSP)[42]编程模型,其中应用程序以轮执行,每轮由局部计算组成,然后进行全局通信[6,33]。第一种方法受到来自各种来源(如操作系统)的开销的影响,并且需要一个用于通信的消息传递接口。然而,BSP模型可能在实现基于轮的执行的全局屏障上引入不必要的序列化。这两种编程方法都可能导致多gpu平台的利用率不足,特别是对于不规则的应用程序,它可能遭受负载不平衡和可能具有不可预测的通信模式。原则上,异步编程模型可以减少其中的一些问题,因为与基于轮的通信不同,处理器可以自主计算和通信,而无需等待其他处理器到达全局屏障。然而,很少有应用程序利用异步执行,因为它们的开发需要深入了解底层体系结构和通信网络,并涉及对代码执行复杂的调整。本文介绍了Groute,这是一种异步编程模型和运行时环境b[2],可用于在多gpu系统上开发各种应用程序。基于低级网络的概念,Groute旨在克服多gpu和异构平台上异步应用程序的编程复杂性。Groute的通信结构很简单,但它们可以用来有效地表达从常规应用程序和BSP应用程序到不平凡的不规则算法的各种程序。运行时环境的异步特性还促进了负载平衡,从而更好地利用异构多gpu节点。本文是先前发布的工作[7]的扩展版本,其中我们更详细地解释了概念,考虑了较新的多gpu拓扑,并详细说明了评估的算法以及可伸缩性注意事项。主要贡献如下:•我们定义了用于异步执行和通信的抽象编程结构。•我们展示了这些结构可用于定义各种算法,包括规则和不规则并行算法。美国计算机学会并行计算学报,Vol. 7, No. 3, Article 18。出版日期:2020年6月。•我们比较了我们实现的性能方面,使用在现有框架中编写的应用程序作为基准。•我们表明,使用Groute,可以实现异步应用程序,在大多数情况下优于最先进的实现,与单个GPU的基线执行相比,在8个GPU上产生高达7.32倍的加速。一般来说,加速器的作用是通过允许它们卸载应用程序的数据并行部分来补充可用的cpu。cpu依次负责进程管理、通信、输入/输出任务、内存传输和数据预处理/后处理。如图1所示,cpu和加速器通过前端总线(FSB,实现包括QPI和HyperTransport)相互连接。FSB通道(其计数是内存传输带宽的一个指标)连接到PCI-Express或NVLink等同时支持CPU-GPU和GPU-GPU通信的互连。由于硬件布局的限制,例如使用相同的主板和电源单元,多gpu节点通常由~ 1-25个gpu组成。cpu、gpu和互连的拓扑结构有两种,一种是完全全对连接,另一种是分层交换拓扑,如图所示。在图1(a)所示的树状拓扑中,gpu的每个四联体(即1 - 4和5-8)之间可以执行直接通信操作,但与其他四联体的通信是间接的,因此速度较慢。 例如,GPU 1和GPU 4可以直接通信,但从GPU 4到GPU 5的数据传输必须经过互连。交换接口允许每个CPU以相同的速率与所有gpu通信。在其他配置中,cpu直接连接到它们的四组gpu,这导致CPU-GPU带宽可变,具体取决于进程的位置。GPU架构包含多个内存复制引擎,支持同时执行代码和双向(输入/输出)内存传输。下面,我们将详细说明在多gpu节点中使用并发副本进行有效通信的不同方式。gpu间通信gpu之间的内存传输由供应商运行时通过隐式和显式接口提供。对于前者,统一内存和托管内存等抽象利用虚拟内存执行复制、分页和预取。然而,通过显式复制,用户可以完全控制内存的传输方式和时间。当确切的内存访问模式已知时,通常最好显式地控制内存移动,因为预取可能会损害内存延迟绑定的应用程序。出于这个原因,我们将在下面重点讨论显式gpu间通信。显卡之间的显式内存传输可以由主机或设备发起。主机发起的内存传输(Peer transfer)由显式复制命令支持,而设备发起的内存传输(Direct Access, DA)是通过gpu间的内存访问实现的。请注意,根据总线拓扑,并非所有gpu对之间都可以直接访问对等内存。但是,所有gpu都可以访问固定的主机内存。设备发起的内存传输是通过虚拟寻址实现的,它将所有主机和设备内存映射到单个地址空间。虽然比对等传输更灵活,但数据处理性能对内存对齐、合并、活动线程数量和访问顺序高度敏感。使用微基准测试(图2),我们在实验设置的8个gpu系统上测量了100 MB的传输,平均超过100次试验(参见第5节了解详细规格)。美国计算机学会并行计算学报,Vol. 7, No. 3, Article 18。出版日期:2020年6月。[18] T. Ben-Nun等。图2所示。gpu间内存传输微基准。图2(a)显示了同板gpu、不同板gpu和CPU-GPU通信的设备启动内存访问传输速率。该图展示了数据处理频谱的两个极端——从严格管理的合并访问(蓝色条,左侧)到随机的非管理访问(红色条,右侧)。观察到合并访问的性能比随机访问高21倍。还要注意,内存传输速率与拓扑中路径的距离相关。由于增加了双板gpu的级别(如图1(a)所示),CPU-GPU的传输速度比两个不同板的gpu更快。为了支持无法访问彼此内存的gpu之间的设备启动传输,可以执行两阶段间接复制。在间接复制中,源GPU首先将信息“推送”到主机内存,然后由目标GPU使用主机标志和系统范围的内存围栏进行同步“拉”。在图1(a)所示的拓扑中,gpu一次只能传输到一个目的地。这阻碍了异步系统的响应性,特别是在传输大缓冲区时。解决这个问题的一种方法是将消息分成数据包,就像在网络中一样。图2(b)显示了使用分组内存传输与使用单个对等传输的开销。如图所示,开销随着数据包大小的增加呈线性下降,对于1 - 10mb的数据包,开销在~ 1%到10%之间。单个应用程序可以调整此参数,以平衡延迟和带宽。图2(c)比较了直接(推)和间接(推/拉)传输的传输速率,显示了分组设备发起的传输和细粒度控制是有利的,甚至优于主机管理的分组对等传输。请注意,由于设备启动的内存访问是用用户代码编写的,因此可以在传输期间执行额外的数据处理。多gpu通信的另一个重要方面是多个源/目标传输,如在集体操作中。由于互连和内存复制引擎的结构,幼稚的应用程序很可能使总线拥挤。在NCCL库[31]中使用的一种方法是在总线上创建环形拓扑。在这种方法中,如图3所示,每个GPU传输到ACM Transactions on Parallel Computing, Vol. 7, No. 3, Article 18。出版日期:2020年6月。图3.异步多gpu编程模型DA环拓扑。图4所示。单GPU架构。 一个目的地,通过direct或I进行通信
本文章由计算机程序翻译,如有差异,请以英文原文为准。