使用真实系统评估内存处理的可扩展性限制

Proc. ACM Meas. Anal. Comput. Syst. Pub Date : 2024-02-16 DOI:10.1145/3639046

Gilbert Jonatan, Haeyoon Cho, Hyojun Son, Xiangyu Wu, Neal Livesay, Evelio Mora, Kaustubh Shivdikar, José L. Abellán, Ajay Joshi, David Kaeli, John Kim

{"title":"使用真实系统评估内存处理的可扩展性限制","authors":"Gilbert Jonatan, Haeyoon Cho, Hyojun Son, Xiangyu Wu, Neal Livesay, Evelio Mora, Kaustubh Shivdikar, José L. Abellán, Ajay Joshi, David Kaeli, John Kim","doi":"10.1145/3639046","DOIUrl":null,"url":null,"abstract":"Processing-in-memory (PIM), where the compute is moved closer to the memory or the data, has been widely explored to accelerate emerging workloads. Recently, different PIM-based systems have been announced by memory vendors to minimize data movement and improve performance as well as energy efficiency. One critical component of PIM is the large amount of compute parallelism provided across many PIM \"nodes'' or the compute units near the memory. In this work, we provide an extensive evaluation and analysis of real PIM systems based on UPMEM PIM. We show that while there are benefits of PIM, there are also scalability challenges and limitations as the number of PIM nodes increases. In particular, we show how collective communications that are commonly found in many kernels/workloads can be problematic for PIM systems. To evaluate the impact of collective communication in PIM architectures, we provide an in-depth analysis of two workloads on the UPMEM PIM system that utilize representative common collective communication patterns -- AllReduce and All-to-All communication. Specifically, we evaluate 1) embedding tables that are commonly used in recommendation systems that require AllReduce and 2) the Number Theoretic Transform (NTT) kernel which is a critical component of Fully Homomorphic Encryption (FHE) that requires All-to-All communication. We analyze the performance benefits of these workloads and show how they can be efficiently mapped to the PIM architecture through alternative data partitioning. However, since each PIM compute unit can only access its local memory, when communication is necessary between PIM nodes (or remote data is needed), communication between the compute units must be done through the host CPU, thereby severely hampering application performance. To increase the scalability (or applicability) of PIM to future workloads, we make the case for how future PIM architectures need efficient communication or interconnection networks between the PIM nodes that require both hardware and software support.","PeriodicalId":335883,"journal":{"name":"Proc. ACM Meas. Anal. Comput. Syst.","volume":"292 2","pages":"5:1-5:28"},"PeriodicalIF":0.0000,"publicationDate":"2024-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Scalability Limitations of Processing-in-Memory using Real System Evaluations\",\"authors\":\"Gilbert Jonatan, Haeyoon Cho, Hyojun Son, Xiangyu Wu, Neal Livesay, Evelio Mora, Kaustubh Shivdikar, José L. Abellán, Ajay Joshi, David Kaeli, John Kim\",\"doi\":\"10.1145/3639046\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Processing-in-memory (PIM), where the compute is moved closer to the memory or the data, has been widely explored to accelerate emerging workloads. Recently, different PIM-based systems have been announced by memory vendors to minimize data movement and improve performance as well as energy efficiency. One critical component of PIM is the large amount of compute parallelism provided across many PIM \\\"nodes'' or the compute units near the memory. In this work, we provide an extensive evaluation and analysis of real PIM systems based on UPMEM PIM. We show that while there are benefits of PIM, there are also scalability challenges and limitations as the number of PIM nodes increases. In particular, we show how collective communications that are commonly found in many kernels/workloads can be problematic for PIM systems. To evaluate the impact of collective communication in PIM architectures, we provide an in-depth analysis of two workloads on the UPMEM PIM system that utilize representative common collective communication patterns -- AllReduce and All-to-All communication. Specifically, we evaluate 1) embedding tables that are commonly used in recommendation systems that require AllReduce and 2) the Number Theoretic Transform (NTT) kernel which is a critical component of Fully Homomorphic Encryption (FHE) that requires All-to-All communication. We analyze the performance benefits of these workloads and show how they can be efficiently mapped to the PIM architecture through alternative data partitioning. However, since each PIM compute unit can only access its local memory, when communication is necessary between PIM nodes (or remote data is needed), communication between the compute units must be done through the host CPU, thereby severely hampering application performance. To increase the scalability (or applicability) of PIM to future workloads, we make the case for how future PIM architectures need efficient communication or interconnection networks between the PIM nodes that require both hardware and software support.\",\"PeriodicalId\":335883,\"journal\":{\"name\":\"Proc. ACM Meas. Anal. Comput. Syst.\",\"volume\":\"292 2\",\"pages\":\"5:1-5:28\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-02-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proc. ACM Meas. Anal. Comput. Syst.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3639046\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proc. ACM Meas. Anal. Comput. Syst.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3639046","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

内存中处理（PIM）是指将计算移到更靠近内存或数据的地方，它已被广泛用于加速新兴工作负载。最近，内存供应商发布了不同的基于 PIM 的系统，以尽量减少数据移动，提高性能和能效。PIM 的一个关键组成部分是在许多 PIM "节点 "或内存附近的计算单元之间提供大量计算并行性。在这项工作中，我们对基于 UPMEM PIM 的实际 PIM 系统进行了广泛的评估和分析。我们发现，虽然 PIM 有很多优点，但随着 PIM 节点数量的增加，也存在可扩展性方面的挑战和限制。特别是，我们展示了在许多内核/工作负载中常见的集体通信是如何给 PIM 系统带来问题的。为了评估集体通信在 PIM 架构中的影响，我们深入分析了 UPMEM PIM 系统上的两个工作负载，这两个负载使用了具有代表性的常见集体通信模式--AllReduce 和 All-to-All 通信。具体来说，我们评估了：1）推荐系统中常用的嵌入表，它需要 AllReduce；2）数论变换（NTT）内核，它是全同态加密（FHE）的关键组件，需要 All-to-All 通信。我们分析了这些工作负载的性能优势，并展示了如何通过替代数据分区将它们高效地映射到 PIM 架构。然而，由于每个 PIM 计算单元只能访问其本地内存，当 PIM 节点之间需要通信（或需要远程数据）时，计算单元之间的通信必须通过主机 CPU 完成，从而严重影响了应用性能。为了提高 PIM 对未来工作负载的可扩展性（或适用性），我们提出了未来的 PIM 架构如何需要 PIM 节点之间的高效通信或互连网络，这需要硬件和软件的支持。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Scalability Limitations of Processing-in-Memory using Real System Evaluations

Processing-in-memory (PIM), where the compute is moved closer to the memory or the data, has been widely explored to accelerate emerging workloads. Recently, different PIM-based systems have been announced by memory vendors to minimize data movement and improve performance as well as energy efficiency. One critical component of PIM is the large amount of compute parallelism provided across many PIM "nodes'' or the compute units near the memory. In this work, we provide an extensive evaluation and analysis of real PIM systems based on UPMEM PIM. We show that while there are benefits of PIM, there are also scalability challenges and limitations as the number of PIM nodes increases. In particular, we show how collective communications that are commonly found in many kernels/workloads can be problematic for PIM systems. To evaluate the impact of collective communication in PIM architectures, we provide an in-depth analysis of two workloads on the UPMEM PIM system that utilize representative common collective communication patterns -- AllReduce and All-to-All communication. Specifically, we evaluate 1) embedding tables that are commonly used in recommendation systems that require AllReduce and 2) the Number Theoretic Transform (NTT) kernel which is a critical component of Fully Homomorphic Encryption (FHE) that requires All-to-All communication. We analyze the performance benefits of these workloads and show how they can be efficiently mapped to the PIM architecture through alternative data partitioning. However, since each PIM compute unit can only access its local memory, when communication is necessary between PIM nodes (or remote data is needed), communication between the compute units must be done through the host CPU, thereby severely hampering application performance. To increase the scalability (or applicability) of PIM to future workloads, we make the case for how future PIM architectures need efficient communication or interconnection networks between the PIM nodes that require both hardware and software support.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proc. ACM Meas. Anal. Comput. Syst.

自引率

0.00%

发文量