Tal Ben-Nun, M. Sutton, Sreepathi Pai, K. Pingali
{"title":"异步多gpu编程模型与大规模图形处理的应用","authors":"Tal Ben-Nun, M. Sutton, Sreepathi Pai, K. Pingali","doi":"10.1145/3399730","DOIUrl":null,"url":null,"abstract":"ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2020 Association for Computing Machinery. 2329-4949/2020/06-ART18 $15.00 https://doi.org/10.1145/3399730 ACM Transactions on Parallel Computing, Vol. 7, No. 3, Article 18. Publication date: June 2020. 18:2 T. Ben-Nun et al. Fig. 1. Multi-GPU node schematics. via a low-latency, high-throughput bus (see Figure 1). These interconnects allow parallel applications to exchange data efficiently and to take advantage of the combined computational power and memory size of the GPUs, but may vary substantially between node types. Multi-GPU nodes are usually programmed using one of two methods. In the simple approach, each GPU is managed separately, using one process per device [19, 26]. Alternatively, a Bulk Synchronous Parallel (BSP) [42] programming model is used, in which applications are executed in rounds, and each round consists of local computation followed by global communication [6, 33]. The first approach is subject to overhead from various sources, such as the operating system, and requires a message-passing interface for communication. The BSP model, however, can introduce unnecessary serialization at the global barriers that implement round-based execution. Both programming methods may result in under-utilization of multi-GPU platforms, particularly for irregular applications, which may suffer from load imbalance and may have unpredictable communication patterns. In principle, asynchronous programming models can reduce some of those problems, because unlike in round-based communication, processors can compute and communicate autonomously without waiting for other processors to reach global barriers. However, there are few applications that exploit asynchronous execution, since their development requires an in-depth knowledge of the underlying architecture and communication network and involves performing intricate adaptations to the code. This article presents Groute, an asynchronous programming model and runtime environment [2] that can be used to develop a wide range of applications on multi-GPU systems. Based on concepts from low-level networking, Groute aims to overcome the programming complexity of asynchronous applications on multi-GPU and heterogeneous platforms. The communication constructs of Groute are simple, but they can be used to efficiently express programs that range from regular applications and BSP applications to nontrivial irregular algorithms. The asynchronous nature of the runtime environment also promotes load balancing, leading to better utilization of heterogeneous multi-GPU nodes. This article is an extended version of previously published work [7], where we explain the concepts in greater detail, consider newer multi-GPU topologies, and elaborate on the evaluated algorithms, as well as scalability considerations. The main contributions are the following: • We define abstract programming constructs for asynchronous execution and communication. • We show that these constructs can be used to define a variety of algorithms, including regular and irregular parallel algorithms. ACM Transactions on Parallel Computing, Vol. 7, No. 3, Article 18. Publication date: June 2020. Groute: Asynchronous Multi-GPU Programming Model 18:3 • We compare aspects of the performance of our implementations, using applications written in existing frameworks as benchmarks. • We show that using Groute, it is possible to implement asynchronous applications that in most cases outperform state-of-the-art implementations, yielding up to 7.32× speedup on eight GPUs compared to a baseline execution on a single GPU. 2 MULTI-GPU NODE ARCHITECTURE In general, the role of accelerators is to complement the available CPUs by allowing them to offload data-parallel portions of an application. The CPUs, in turn, are responsible for process management, communication, input/output tasks, memory transfers, and data pre/post-processing. As illustrated in Figure 1, the CPUs and accelerators are connected to each other via a Front-Side Bus (FSB, implementations include QPI and HyperTransport). The FSB lanes, whose count is an indicator of the memory transfer bandwidth, are linked to an interconnect such as PCI-Express or NVLink that supports both CPU-GPU and GPU-GPU communications. Due to limitations in the hardware layout, such as use of the same motherboard and power supply units, multi-GPU nodes typically consist of ∼1–25 GPUs. The topology of the CPUs, GPUs, and interconnect can vary between complete all-pair connections and a hierarchical switched topology, as shown in the figure. In the tree-topology shown in Figure 1(a), each quadruplet of GPUs (i.e., 1–4 and 5–8) can perform direct communication operations amongst themselves, but communications with the other quadruplet are indirect and thus slower. For example, GPUs 1 and 4 can perform direct communication, but data transfers from GPU 4 to 5 must pass through the interconnect. A switched interface allows each CPU to communicate with all GPUs at the same rate. In other configurations, CPUs are directly connected to their quadruplet of GPUs, which results in variable CPU-GPU bandwidth, depending on process placement. The GPU architecture contains multiple memory copy engines, enabling simultaneous code execution and two-way (input/output) memory transfer. Below, we elaborate on the different ways concurrent copies can be used to efficiently communicate within a multi-GPU node. 2.1 Inter-GPU Communication Memory transfers among GPUs are provided by the vendor runtime via implicit and explicit interfaces. For the former, abstractions such as Unified and Managed Memory make use of virtual memory to perform copies, paging, and prefetching. With explicit copies, however, the user maintains full control over how and when memory is transferred. When exact memory access patterns are known, it is generally preferable to explicitly control memory movement, as prefetching may hurt memory-latency bound applications, for instance. For this reason, we focus below on explicit inter-GPU communication. Explicit memory transfers among GPUs can either be initiated by the host or a device. Hostinitiated memory transfer (Peer Transfer) is supported by explicit copy commands, whereas deviceinitiated memory transfer (Direct Access, DA) is implemented using inter-GPU memory accesses. Note that direct access to peer memory may not be available between all pairs of GPUs, depending on the bus topology. Access to pinned host memory, however, is possible from all GPUs. Device-initiated memory transfers are implemented by virtual addressing, which maps all host and device memory to a single address space. While more flexible than peer transfers, DA performance is highly sensitive to memory alignment, coalescing, number of active threads, and order of access. Using microbenchmarks (Figure 2), we measure 100 MB transfers, averaged over 100 trials, on theeight-GPU system from our experimental setup (see Section 5 for detailed specifications). ACM Transactions on Parallel Computing, Vol. 7, No. 3, Article 18. Publication date: June 2020. 18:4 T. Ben-Nun et al. Fig. 2. Inter-GPU memory transfer microbenchmarks. Figure 2(a) shows the transfer rate of device-initiated memory access on GPUs that reside in the same board, on different boards, and CPU-GPU communication. The figure demonstrates the two extremes of the DA spectrum—from tightly managed coalesced access (blue bars, left-hand side) to random, unmanaged access (red bars, right-hand side). Observe that coalesced access performs up to 21× better than random access. Also notice that the memory transfer rate correlates with the distance of the path in the topology. Due to the added level of dual-board GPUs (shown in Figure 1(a)), CPU-GPU transfer is faster than two different-board GPUs. To support device-initiated transfers between GPUs that cannot access each other’s memory, it is possible to perform a two-phase indirect copy. In indirect copy, the source GPU “pushes” information to host memory first, after which it is “pulled” by the destination GPU using host flags and system-wide memory fences for synchronization. In topologies such as the one presented in Figure 1(a), GPUs can only transmit to one destination at a time. This hinders the responsiveness of an asynchronous system, especially when transferring large buffers. One way to resolve this issue is by dividing messages into packets, as in networking. Figure 2(b) presents the overhead of using packetized memory transfers as opposed to a single peer transfer. The figure shows that the overhead decreases linearly as the packet size increases, ranging between ∼1% and 10% for 1–10 MB packets. This parameter can be tuned by individual applications to balance between latency and bandwidth. Figure 2(c) compares the transfer rate of direct (push) and indirect (push/pull) transfers, showing that packetized device-initiated transfers and the fine-grained control is advantageous, even over the host-managed packetized peer transfers. Note that, since device-initiated memory access is written in user code, it is possible to perform additional data processing during transfer. Another important aspect of multi-GPU communication is multiple source/destination transfers, as in collective operations. Due to the structure of the interconnect and memory copy engines, a naive application is likely to congest the bus. One approach, used in the NCCL library [31], creates a ring topology over the bus. In this approach, illustrated in Figure 3, each GPU transfers to ACM Transactions on Parallel Computing, Vol. 7, No. 3, Article 18. Publication date: June 2020. Groute: Asynchronous Multi-GPU Programming Model 18:5 Fig. 3. DA Ring topology. Fig. 4. Single GPU architecture. one destination, communicating via direct or i","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":"32 1","pages":"18:1-18:27"},"PeriodicalIF":0.9000,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Groute: Asynchronous Multi-GPU Programming Model with Applications to Large-scale Graph Processing\",\"authors\":\"Tal Ben-Nun, M. Sutton, Sreepathi Pai, K. Pingali\",\"doi\":\"10.1145/3399730\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2020 Association for Computing Machinery. 2329-4949/2020/06-ART18 $15.00 https://doi.org/10.1145/3399730 ACM Transactions on Parallel Computing, Vol. 7, No. 3, Article 18. Publication date: June 2020. 18:2 T. Ben-Nun et al. Fig. 1. Multi-GPU node schematics. via a low-latency, high-throughput bus (see Figure 1). These interconnects allow parallel applications to exchange data efficiently and to take advantage of the combined computational power and memory size of the GPUs, but may vary substantially between node types. Multi-GPU nodes are usually programmed using one of two methods. In the simple approach, each GPU is managed separately, using one process per device [19, 26]. Alternatively, a Bulk Synchronous Parallel (BSP) [42] programming model is used, in which applications are executed in rounds, and each round consists of local computation followed by global communication [6, 33]. The first approach is subject to overhead from various sources, such as the operating system, and requires a message-passing interface for communication. The BSP model, however, can introduce unnecessary serialization at the global barriers that implement round-based execution. Both programming methods may result in under-utilization of multi-GPU platforms, particularly for irregular applications, which may suffer from load imbalance and may have unpredictable communication patterns. In principle, asynchronous programming models can reduce some of those problems, because unlike in round-based communication, processors can compute and communicate autonomously without waiting for other processors to reach global barriers. However, there are few applications that exploit asynchronous execution, since their development requires an in-depth knowledge of the underlying architecture and communication network and involves performing intricate adaptations to the code. This article presents Groute, an asynchronous programming model and runtime environment [2] that can be used to develop a wide range of applications on multi-GPU systems. Based on concepts from low-level networking, Groute aims to overcome the programming complexity of asynchronous applications on multi-GPU and heterogeneous platforms. The communication constructs of Groute are simple, but they can be used to efficiently express programs that range from regular applications and BSP applications to nontrivial irregular algorithms. The asynchronous nature of the runtime environment also promotes load balancing, leading to better utilization of heterogeneous multi-GPU nodes. This article is an extended version of previously published work [7], where we explain the concepts in greater detail, consider newer multi-GPU topologies, and elaborate on the evaluated algorithms, as well as scalability considerations. The main contributions are the following: • We define abstract programming constructs for asynchronous execution and communication. • We show that these constructs can be used to define a variety of algorithms, including regular and irregular parallel algorithms. ACM Transactions on Parallel Computing, Vol. 7, No. 3, Article 18. Publication date: June 2020. Groute: Asynchronous Multi-GPU Programming Model 18:3 • We compare aspects of the performance of our implementations, using applications written in existing frameworks as benchmarks. • We show that using Groute, it is possible to implement asynchronous applications that in most cases outperform state-of-the-art implementations, yielding up to 7.32× speedup on eight GPUs compared to a baseline execution on a single GPU. 2 MULTI-GPU NODE ARCHITECTURE In general, the role of accelerators is to complement the available CPUs by allowing them to offload data-parallel portions of an application. The CPUs, in turn, are responsible for process management, communication, input/output tasks, memory transfers, and data pre/post-processing. As illustrated in Figure 1, the CPUs and accelerators are connected to each other via a Front-Side Bus (FSB, implementations include QPI and HyperTransport). The FSB lanes, whose count is an indicator of the memory transfer bandwidth, are linked to an interconnect such as PCI-Express or NVLink that supports both CPU-GPU and GPU-GPU communications. Due to limitations in the hardware layout, such as use of the same motherboard and power supply units, multi-GPU nodes typically consist of ∼1–25 GPUs. The topology of the CPUs, GPUs, and interconnect can vary between complete all-pair connections and a hierarchical switched topology, as shown in the figure. In the tree-topology shown in Figure 1(a), each quadruplet of GPUs (i.e., 1–4 and 5–8) can perform direct communication operations amongst themselves, but communications with the other quadruplet are indirect and thus slower. For example, GPUs 1 and 4 can perform direct communication, but data transfers from GPU 4 to 5 must pass through the interconnect. A switched interface allows each CPU to communicate with all GPUs at the same rate. In other configurations, CPUs are directly connected to their quadruplet of GPUs, which results in variable CPU-GPU bandwidth, depending on process placement. The GPU architecture contains multiple memory copy engines, enabling simultaneous code execution and two-way (input/output) memory transfer. Below, we elaborate on the different ways concurrent copies can be used to efficiently communicate within a multi-GPU node. 2.1 Inter-GPU Communication Memory transfers among GPUs are provided by the vendor runtime via implicit and explicit interfaces. For the former, abstractions such as Unified and Managed Memory make use of virtual memory to perform copies, paging, and prefetching. With explicit copies, however, the user maintains full control over how and when memory is transferred. When exact memory access patterns are known, it is generally preferable to explicitly control memory movement, as prefetching may hurt memory-latency bound applications, for instance. For this reason, we focus below on explicit inter-GPU communication. Explicit memory transfers among GPUs can either be initiated by the host or a device. Hostinitiated memory transfer (Peer Transfer) is supported by explicit copy commands, whereas deviceinitiated memory transfer (Direct Access, DA) is implemented using inter-GPU memory accesses. Note that direct access to peer memory may not be available between all pairs of GPUs, depending on the bus topology. Access to pinned host memory, however, is possible from all GPUs. Device-initiated memory transfers are implemented by virtual addressing, which maps all host and device memory to a single address space. While more flexible than peer transfers, DA performance is highly sensitive to memory alignment, coalescing, number of active threads, and order of access. Using microbenchmarks (Figure 2), we measure 100 MB transfers, averaged over 100 trials, on theeight-GPU system from our experimental setup (see Section 5 for detailed specifications). ACM Transactions on Parallel Computing, Vol. 7, No. 3, Article 18. Publication date: June 2020. 18:4 T. Ben-Nun et al. Fig. 2. Inter-GPU memory transfer microbenchmarks. Figure 2(a) shows the transfer rate of device-initiated memory access on GPUs that reside in the same board, on different boards, and CPU-GPU communication. The figure demonstrates the two extremes of the DA spectrum—from tightly managed coalesced access (blue bars, left-hand side) to random, unmanaged access (red bars, right-hand side). Observe that coalesced access performs up to 21× better than random access. Also notice that the memory transfer rate correlates with the distance of the path in the topology. Due to the added level of dual-board GPUs (shown in Figure 1(a)), CPU-GPU transfer is faster than two different-board GPUs. To support device-initiated transfers between GPUs that cannot access each other’s memory, it is possible to perform a two-phase indirect copy. In indirect copy, the source GPU “pushes” information to host memory first, after which it is “pulled” by the destination GPU using host flags and system-wide memory fences for synchronization. In topologies such as the one presented in Figure 1(a), GPUs can only transmit to one destination at a time. This hinders the responsiveness of an asynchronous system, especially when transferring large buffers. One way to resolve this issue is by dividing messages into packets, as in networking. Figure 2(b) presents the overhead of using packetized memory transfers as opposed to a single peer transfer. The figure shows that the overhead decreases linearly as the packet size increases, ranging between ∼1% and 10% for 1–10 MB packets. This parameter can be tuned by individual applications to balance between latency and bandwidth. Figure 2(c) compares the transfer rate of direct (push) and indirect (push/pull) transfers, showing that packetized device-initiated transfers and the fine-grained control is advantageous, even over the host-managed packetized peer transfers. Note that, since device-initiated memory access is written in user code, it is possible to perform additional data processing during transfer. Another important aspect of multi-GPU communication is multiple source/destination transfers, as in collective operations. Due to the structure of the interconnect and memory copy engines, a naive application is likely to congest the bus. One approach, used in the NCCL library [31], creates a ring topology over the bus. In this approach, illustrated in Figure 3, each GPU transfers to ACM Transactions on Parallel Computing, Vol. 7, No. 3, Article 18. Publication date: June 2020. Groute: Asynchronous Multi-GPU Programming Model 18:5 Fig. 3. DA Ring topology. Fig. 4. Single GPU architecture. one destination, communicating via direct or i\",\"PeriodicalId\":42115,\"journal\":{\"name\":\"ACM Transactions on Parallel Computing\",\"volume\":\"32 1\",\"pages\":\"18:1-18:27\"},\"PeriodicalIF\":0.9000,\"publicationDate\":\"2020-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"ACM Transactions on Parallel Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3399730\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"COMPUTER SCIENCE, THEORY & METHODS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Parallel Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3399730","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"COMPUTER SCIENCE, THEORY & METHODS","Score":null,"Total":0}
引用次数: 7
Groute: Asynchronous Multi-GPU Programming Model with Applications to Large-scale Graph Processing
ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2020 Association for Computing Machinery. 2329-4949/2020/06-ART18 $15.00 https://doi.org/10.1145/3399730 ACM Transactions on Parallel Computing, Vol. 7, No. 3, Article 18. Publication date: June 2020. 18:2 T. Ben-Nun et al. Fig. 1. Multi-GPU node schematics. via a low-latency, high-throughput bus (see Figure 1). These interconnects allow parallel applications to exchange data efficiently and to take advantage of the combined computational power and memory size of the GPUs, but may vary substantially between node types. Multi-GPU nodes are usually programmed using one of two methods. In the simple approach, each GPU is managed separately, using one process per device [19, 26]. Alternatively, a Bulk Synchronous Parallel (BSP) [42] programming model is used, in which applications are executed in rounds, and each round consists of local computation followed by global communication [6, 33]. The first approach is subject to overhead from various sources, such as the operating system, and requires a message-passing interface for communication. The BSP model, however, can introduce unnecessary serialization at the global barriers that implement round-based execution. Both programming methods may result in under-utilization of multi-GPU platforms, particularly for irregular applications, which may suffer from load imbalance and may have unpredictable communication patterns. In principle, asynchronous programming models can reduce some of those problems, because unlike in round-based communication, processors can compute and communicate autonomously without waiting for other processors to reach global barriers. However, there are few applications that exploit asynchronous execution, since their development requires an in-depth knowledge of the underlying architecture and communication network and involves performing intricate adaptations to the code. This article presents Groute, an asynchronous programming model and runtime environment [2] that can be used to develop a wide range of applications on multi-GPU systems. Based on concepts from low-level networking, Groute aims to overcome the programming complexity of asynchronous applications on multi-GPU and heterogeneous platforms. The communication constructs of Groute are simple, but they can be used to efficiently express programs that range from regular applications and BSP applications to nontrivial irregular algorithms. The asynchronous nature of the runtime environment also promotes load balancing, leading to better utilization of heterogeneous multi-GPU nodes. This article is an extended version of previously published work [7], where we explain the concepts in greater detail, consider newer multi-GPU topologies, and elaborate on the evaluated algorithms, as well as scalability considerations. The main contributions are the following: • We define abstract programming constructs for asynchronous execution and communication. • We show that these constructs can be used to define a variety of algorithms, including regular and irregular parallel algorithms. ACM Transactions on Parallel Computing, Vol. 7, No. 3, Article 18. Publication date: June 2020. Groute: Asynchronous Multi-GPU Programming Model 18:3 • We compare aspects of the performance of our implementations, using applications written in existing frameworks as benchmarks. • We show that using Groute, it is possible to implement asynchronous applications that in most cases outperform state-of-the-art implementations, yielding up to 7.32× speedup on eight GPUs compared to a baseline execution on a single GPU. 2 MULTI-GPU NODE ARCHITECTURE In general, the role of accelerators is to complement the available CPUs by allowing them to offload data-parallel portions of an application. The CPUs, in turn, are responsible for process management, communication, input/output tasks, memory transfers, and data pre/post-processing. As illustrated in Figure 1, the CPUs and accelerators are connected to each other via a Front-Side Bus (FSB, implementations include QPI and HyperTransport). The FSB lanes, whose count is an indicator of the memory transfer bandwidth, are linked to an interconnect such as PCI-Express or NVLink that supports both CPU-GPU and GPU-GPU communications. Due to limitations in the hardware layout, such as use of the same motherboard and power supply units, multi-GPU nodes typically consist of ∼1–25 GPUs. The topology of the CPUs, GPUs, and interconnect can vary between complete all-pair connections and a hierarchical switched topology, as shown in the figure. In the tree-topology shown in Figure 1(a), each quadruplet of GPUs (i.e., 1–4 and 5–8) can perform direct communication operations amongst themselves, but communications with the other quadruplet are indirect and thus slower. For example, GPUs 1 and 4 can perform direct communication, but data transfers from GPU 4 to 5 must pass through the interconnect. A switched interface allows each CPU to communicate with all GPUs at the same rate. In other configurations, CPUs are directly connected to their quadruplet of GPUs, which results in variable CPU-GPU bandwidth, depending on process placement. The GPU architecture contains multiple memory copy engines, enabling simultaneous code execution and two-way (input/output) memory transfer. Below, we elaborate on the different ways concurrent copies can be used to efficiently communicate within a multi-GPU node. 2.1 Inter-GPU Communication Memory transfers among GPUs are provided by the vendor runtime via implicit and explicit interfaces. For the former, abstractions such as Unified and Managed Memory make use of virtual memory to perform copies, paging, and prefetching. With explicit copies, however, the user maintains full control over how and when memory is transferred. When exact memory access patterns are known, it is generally preferable to explicitly control memory movement, as prefetching may hurt memory-latency bound applications, for instance. For this reason, we focus below on explicit inter-GPU communication. Explicit memory transfers among GPUs can either be initiated by the host or a device. Hostinitiated memory transfer (Peer Transfer) is supported by explicit copy commands, whereas deviceinitiated memory transfer (Direct Access, DA) is implemented using inter-GPU memory accesses. Note that direct access to peer memory may not be available between all pairs of GPUs, depending on the bus topology. Access to pinned host memory, however, is possible from all GPUs. Device-initiated memory transfers are implemented by virtual addressing, which maps all host and device memory to a single address space. While more flexible than peer transfers, DA performance is highly sensitive to memory alignment, coalescing, number of active threads, and order of access. Using microbenchmarks (Figure 2), we measure 100 MB transfers, averaged over 100 trials, on theeight-GPU system from our experimental setup (see Section 5 for detailed specifications). ACM Transactions on Parallel Computing, Vol. 7, No. 3, Article 18. Publication date: June 2020. 18:4 T. Ben-Nun et al. Fig. 2. Inter-GPU memory transfer microbenchmarks. Figure 2(a) shows the transfer rate of device-initiated memory access on GPUs that reside in the same board, on different boards, and CPU-GPU communication. The figure demonstrates the two extremes of the DA spectrum—from tightly managed coalesced access (blue bars, left-hand side) to random, unmanaged access (red bars, right-hand side). Observe that coalesced access performs up to 21× better than random access. Also notice that the memory transfer rate correlates with the distance of the path in the topology. Due to the added level of dual-board GPUs (shown in Figure 1(a)), CPU-GPU transfer is faster than two different-board GPUs. To support device-initiated transfers between GPUs that cannot access each other’s memory, it is possible to perform a two-phase indirect copy. In indirect copy, the source GPU “pushes” information to host memory first, after which it is “pulled” by the destination GPU using host flags and system-wide memory fences for synchronization. In topologies such as the one presented in Figure 1(a), GPUs can only transmit to one destination at a time. This hinders the responsiveness of an asynchronous system, especially when transferring large buffers. One way to resolve this issue is by dividing messages into packets, as in networking. Figure 2(b) presents the overhead of using packetized memory transfers as opposed to a single peer transfer. The figure shows that the overhead decreases linearly as the packet size increases, ranging between ∼1% and 10% for 1–10 MB packets. This parameter can be tuned by individual applications to balance between latency and bandwidth. Figure 2(c) compares the transfer rate of direct (push) and indirect (push/pull) transfers, showing that packetized device-initiated transfers and the fine-grained control is advantageous, even over the host-managed packetized peer transfers. Note that, since device-initiated memory access is written in user code, it is possible to perform additional data processing during transfer. Another important aspect of multi-GPU communication is multiple source/destination transfers, as in collective operations. Due to the structure of the interconnect and memory copy engines, a naive application is likely to congest the bus. One approach, used in the NCCL library [31], creates a ring topology over the bus. In this approach, illustrated in Figure 3, each GPU transfers to ACM Transactions on Parallel Computing, Vol. 7, No. 3, Article 18. Publication date: June 2020. Groute: Asynchronous Multi-GPU Programming Model 18:5 Fig. 3. DA Ring topology. Fig. 4. Single GPU architecture. one destination, communicating via direct or i