Proc. ACM Meas. Anal. Comput. Syst.最新文献

Scalability Limitations of Processing-in-Memory using Real System Evaluations 使用真实系统评估内存处理的可扩展性限制

Proc. ACM Meas. Anal. Comput. Syst. Pub Date : 2024-02-16 DOI: 10.1145/3639046

Gilbert Jonatan, Haeyoon Cho, Hyojun Son, Xiangyu Wu, Neal Livesay, Evelio Mora, Kaustubh Shivdikar, José L. Abellán, Ajay Joshi, David Kaeli, John Kim

{"title":"Scalability Limitations of Processing-in-Memory using Real System Evaluations","authors":"Gilbert Jonatan, Haeyoon Cho, Hyojun Son, Xiangyu Wu, Neal Livesay, Evelio Mora, Kaustubh Shivdikar, José L. Abellán, Ajay Joshi, David Kaeli, John Kim","doi":"10.1145/3639046","DOIUrl":"https://doi.org/10.1145/3639046","url":null,"abstract":"Processing-in-memory (PIM), where the compute is moved closer to the memory or the data, has been widely explored to accelerate emerging workloads. Recently, different PIM-based systems have been announced by memory vendors to minimize data movement and improve performance as well as energy efficiency. One critical component of PIM is the large amount of compute parallelism provided across many PIM \"nodes'' or the compute units near the memory. In this work, we provide an extensive evaluation and analysis of real PIM systems based on UPMEM PIM. We show that while there are benefits of PIM, there are also scalability challenges and limitations as the number of PIM nodes increases. In particular, we show how collective communications that are commonly found in many kernels/workloads can be problematic for PIM systems. To evaluate the impact of collective communication in PIM architectures, we provide an in-depth analysis of two workloads on the UPMEM PIM system that utilize representative common collective communication patterns -- AllReduce and All-to-All communication. Specifically, we evaluate 1) embedding tables that are commonly used in recommendation systems that require AllReduce and 2) the Number Theoretic Transform (NTT) kernel which is a critical component of Fully Homomorphic Encryption (FHE) that requires All-to-All communication. We analyze the performance benefits of these workloads and show how they can be efficiently mapped to the PIM architecture through alternative data partitioning. However, since each PIM compute unit can only access its local memory, when communication is necessary between PIM nodes (or remote data is needed), communication between the compute units must be done through the host CPU, thereby severely hampering application performance. To increase the scalability (or applicability) of PIM to future workloads, we make the case for how future PIM architectures need efficient communication or interconnection networks between the PIM nodes that require both hardware and software support.","PeriodicalId":335883,"journal":{"name":"Proc. ACM Meas. Anal. Comput. Syst.","volume":"292 2","pages":"5:1-5:28"},"PeriodicalIF":0.0,"publicationDate":"2024-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140453838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

SCADA World: An Exploration of the Diversity in Power Grid Networks SCADA 世界：探索电网网络的多样性

Proc. ACM Meas. Anal. Comput. Syst. Pub Date : 2024-02-16 DOI: 10.1145/3639036

Neil Ortiz Silva, Alvaro A. Cárdenas, A. Wool

引用次数: 1

Deep Dive into NTP Pool's Popularity and Mapping 深入了解 NTP 池的普及和映射

Proc. ACM Meas. Anal. Comput. Syst. Pub Date : 2024-02-16 DOI: 10.1145/3639041

G. Moura, Marco Davids, C. Schutijser, Cristian Hesselman, John Heidemann, Georgios Smaragdakis

{"title":"Deep Dive into NTP Pool's Popularity and Mapping","authors":"G. Moura, Marco Davids, C. Schutijser, Cristian Hesselman, John Heidemann, Georgios Smaragdakis","doi":"10.1145/3639041","DOIUrl":"https://doi.org/10.1145/3639041","url":null,"abstract":"Time synchronization is of paramount importance on the Internet, with the Network Time Protocol (NTP) serving as the primary synchronization protocol. The NTP Pool, a volunteer-driven initiative launched two decades ago, facilitates connections between clients and NTP servers. Our analysis of root DNS queries reveals that the NTP Pool has consistently been the most popular time service. We further investigate the DNS component (GeoDNS) of the NTP Pool, which is responsible for mapping clients to servers. Our findings indicate that the current algorithm is heavily skewed, leading to the emergence of time monopolies for entire countries. For instance, clients in the US are served by 551 NTP servers, while clients in Cameroon and Nigeria are served by only one and two servers, respectively, out of the 4k+ servers available in the NTP Pool. We examine the underlying assumption behind GeoDNS for these mappings and discover that time servers located far away can still provide accurate clock time information to clients. We have shared our findings with the NTP Pool operators, who acknowledge them and plan to revise their algorithm to enhance security.","PeriodicalId":335883,"journal":{"name":"Proc. ACM Meas. Anal. Comput. Syst.","volume":"193 1","pages":"15:1-15:30"},"PeriodicalIF":0.0,"publicationDate":"2024-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140454257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

POMACS V8, N1, March 2024 Editorial POMACS V8，N1，2024 年 3 月社论

Proc. ACM Meas. Anal. Comput. Syst. Pub Date : 2024-02-16 DOI: 10.1145/3639027

F. Ciucu, Giulia Fanti, Rhonda Righter

{"title":"POMACS V8, N1, March 2024 Editorial","authors":"F. Ciucu, Giulia Fanti, Rhonda Righter","doi":"10.1145/3639027","DOIUrl":"https://doi.org/10.1145/3639027","url":null,"abstract":"The Proceedings of the ACM on Measurement and Analysis of Computing Systems (POMACS) focuses on the measurement and performance evaluation of computer systems and operates in close collaboration with the ACM Special Interest Group SIGMETRICS. All papers in this issue of POMACS will be presented at the ACM SIGMETRICS/Performance 2024 conference on June 10-14, 2024, in Venice, Italy. These papers have been selected during the Fall submission round by the 93 members of the ACM SIGMETRICS/Performance 2024 program committee via a rigorous review process. Each paper was conditionally accepted (and shepherded), allowed a \"one-shot\" revision (to be resubmitted to one of the subsequent two SIGMETRICS/Performance deadlines), or rejected (with re-submission allowed after a year). For this issue, which represents the Fall deadline, POMACS is publishing 18 papers out of 118 submissions, of which 6 had previously received a one-shot revision decision. All submissions received at least 3 reviews and borderline cases were extensively discussed during the online program committee meeting. Based on the indicated track(s), roughly 33% of the submissions were in the Theory track, 47% were in the Measurement & Applied Modeling track, 39% were in the Systems track, and 19% were in the Learning track (papers could be part of more than one track). Many individuals contributed to the success of this issue of POMACS. First, we would like to thank the authors, who submitted their best work to SIGMETRICS/Performance/POMACS. Second, we would like to thank the program committee members who provided constructive feedback in their reviews to authors and participated in the online discussions and program committee meeting. We also thank the several external reviewers who provided their expert opinions on specific submissions that required additional input. We are also grateful to the SIGMETRICS Board Chair, Mor Harchol-Balter, the IFIP Working Group 7.3 Chair, Mark S. Squillante, the previous SIGMETRICS Board Chair, Giuliano Casale, and the past program committee Chairs, Konstantin Avratchenkov, Phillipa Gill, and Bhuvan Urgaonkar, who provided a wealth of information and guidance. Finally, we are grateful to the Organization Committee and to the SIGMETRICS Board for their ongoing efforts and initiatives for creating an exciting program for ACM SIGMETRICS/Performance 2024.","PeriodicalId":335883,"journal":{"name":"Proc. ACM Meas. Anal. Comput. Syst.","volume":"93 6","pages":"1:1"},"PeriodicalIF":0.0,"publicationDate":"2024-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140454897","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fair Resource Allocation in Virtualized O-RAN Platforms 虚拟化 O-RAN 平台中的公平资源分配

Proc. ACM Meas. Anal. Comput. Syst. Pub Date : 2024-02-16 DOI: 10.48550/arXiv.2402.11285

Fatih Aslan, G. Iosifidis, J. Ayala-Romero, Andres Garcia-Saavedra, Xavier Pérez Costa

{"title":"Fair Resource Allocation in Virtualized O-RAN Platforms","authors":"Fatih Aslan, G. Iosifidis, J. Ayala-Romero, Andres Garcia-Saavedra, Xavier Pérez Costa","doi":"10.48550/arXiv.2402.11285","DOIUrl":"https://doi.org/10.48550/arXiv.2402.11285","url":null,"abstract":"O-RAN systems and their deployment in virtualized general-purpose computing platforms (O-Cloud) constitute a paradigm shift expected to bring unprecedented performance gains. However, these architectures raise new implementation challenges and threaten to worsen the already-high energy consumption of mobile networks. This paper presents first a series of experiments which assess the O-Cloud's energy costs and their dependency on the servers' hardware, capacity and data traffic properties which, typically, change over time. Next, it proposes a compute policy for assigning the base station data loads to O-Cloud servers in an energy-efficient fashion; and a radio policy that determines at near-real-time the minimum transmission block size for each user so as to avoid unnecessary energy costs. The policies balance energy savings with performance, and ensure that both of them are dispersed fairly across the servers and users, respectively. To cater for the unknown and time-varying parameters affecting the policies, we develop a novel online learning framework with fairness guarantees that apply to the entire operation horizon of the system (long-term fairness). The policies are evaluated using trace-driven simulations and are fully implemented in an O-RAN compatible system where we measure the energy costs and throughput in realistic scenarios.","PeriodicalId":335883,"journal":{"name":"Proc. ACM Meas. Anal. Comput. Syst.","volume":"70 7","pages":"17:1-17:34"},"PeriodicalIF":0.0,"publicationDate":"2024-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140455016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Who's Got My Back? Measuring the Adoption of an Internet-wide BGP RTBH Service 谁在支持我？衡量互联网范围内 BGP RTBH 服务的采用情况

Proc. ACM Meas. Anal. Comput. Syst. Pub Date : 2024-02-16 DOI: 10.1145/3639029

Radu Anghel, Yury Zhauniarovich, C. Gañán

{"title":"Who's Got My Back? Measuring the Adoption of an Internet-wide BGP RTBH Service","authors":"Radu Anghel, Yury Zhauniarovich, C. Gañán","doi":"10.1145/3639029","DOIUrl":"https://doi.org/10.1145/3639029","url":null,"abstract":"Distributed Denial-of-Service (DDoS) attacks continue to threaten the availability of Internet-based services. While countermeasures exist to decrease the impact of these attacks, not all operators have the resources or knowledge to deploy them. Alternatively, anti-DDoS services such as DDoS clearing houses and blackholing have emerged. Unwanted Traffic Removal Service (UTRS), being one of the oldest community-based anti-DDoS services, has become a global free collaborative service that aims at mitigating major DDoS attacks through the Border Gateway Protocol (BGP). Once the BGP session with UTRS is established, UTRS members can advertise part of the prefixes belonging to their AS to UTRS. UTRS will forward them to all other participants, who, in turn, should start blocking traffic to the advertised IP addresses. In this paper, we develop and evaluate a methodology to automatically detect UTRS participation in the wild. To this end, we deploy a measurement infrastructure and devise a methodology to detect UTRS-based traffic blocking. Using this methodology, we conducted a longitudinal analysis of UTRS participants over ten weeks. Our results show that at any point in time, there were 562 participants, including multihomed, stub, transit, and IXP ASes. Moreover, we surveyed 245 network operators to understand why they would (not) join UTRS. Results show that threat and coping appraisal significantly influence the intention to participate in UTRS.","PeriodicalId":335883,"journal":{"name":"Proc. ACM Meas. Anal. Comput. Syst.","volume":"56 9","pages":"3:1-3:25"},"PeriodicalIF":0.0,"publicationDate":"2024-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140455110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Shrinking VOD Traffic via Rényi-Entropic Optimal Transport 通过雷尼-各向同性优化传输缩小点播流量

Proc. ACM Meas. Anal. Comput. Syst. Pub Date : 2024-02-16 DOI: 10.1145/3639033

Chi-Jen (Roger) Lo, Mahesh K. Marina, N. Sastry, Kai Xu, Saeed Fadaei, Yong Li

{"title":"Shrinking VOD Traffic via Rényi-Entropic Optimal Transport","authors":"Chi-Jen (Roger) Lo, Mahesh K. Marina, N. Sastry, Kai Xu, Saeed Fadaei, Yong Li","doi":"10.1145/3639033","DOIUrl":"https://doi.org/10.1145/3639033","url":null,"abstract":"In response to the exponential surge in Internet Video on Demand (VOD) traffic, numerous research endeavors have concentrated on optimizing and enhancing infrastructure efficiency. In contrast, this paper explores whether users' demand patterns can be shaped to reduce the pressure on infrastructure. Our main idea is to design a mechanism that alters the distribution of user requests to another distribution which is much more cache-efficient, but still remains 'close enough' (in the sense of cost) to fulfil each individual user's preference. To quantify the cache footprint of VOD traffic, we propose a novel application of Rényi entropy as its proxy, capturing the 'richness' (the number of distinct videos or cache size) and the 'evenness' (the relative popularity of video accesses) of the on-demand video distribution. We then demonstrate how to decrease this metric by formulating a problem drawing on the mathematical theory of optimal transport (OT). Additionally, we establish a key equivalence theorem: minimizing Rényi entropy corresponds to maximizing soft cache hit ratio (SCHR) --- a variant of cache hit ratio allowing similarity-based video substitutions. Evaluation on a real-world, city-scale video viewing dataset reveals a remarkable 83% reduction in cache size (associated with VOD caching traffic). Crucially, in alignment with the above-mentioned equivalence theorem, our approach yields a significant uplift to SCHR, achieving close to 100%.","PeriodicalId":335883,"journal":{"name":"Proc. ACM Meas. Anal. Comput. Syst.","volume":"333 1","pages":"7:1-7:34"},"PeriodicalIF":0.0,"publicationDate":"2024-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140453783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

H3DM: A High-bandwidth High-capacity Hybrid 3D Memory Design for GPUs H3DM：面向 GPU 的高带宽、大容量混合 3D 内存设计

Proc. ACM Meas. Anal. Comput. Syst. Pub Date : 2024-02-16 DOI: 10.1145/3639038

N. Akbarzadeh, Sina Darabi, A. Gheibi-Fetrat, Amir Mirzaei, Mohammad Sadrosadati, H. Sarbazi-Azad

{"title":"H3DM: A High-bandwidth High-capacity Hybrid 3D Memory Design for GPUs","authors":"N. Akbarzadeh, Sina Darabi, A. Gheibi-Fetrat, Amir Mirzaei, Mohammad Sadrosadati, H. Sarbazi-Azad","doi":"10.1145/3639038","DOIUrl":"https://doi.org/10.1145/3639038","url":null,"abstract":"Graphics Processing Units (GPUs) are widely used for modern applications with huge data sizes. However, the performance benefit of GPUs is limited by their memory capacity and bandwidth. Although GPU vendors improve memory capacity and bandwidth using 3D memory technology (HBM), many important workloads with terabytes of data still cannot fit in the provided capacity and are bound by the provided bandwidth. With a limited GPU memory capacity, programmers should handle the data movement between GPU and host memories by themselves, causing a significant programming burden. To improve programming ease, GPUs use a unified address space with the host that allows over-subscribing GPU memory, but this approach is not effective in terms of performance once GPUs encounter memory page faults. Many recent works have tried to remedy capacity and bandwidth bottlenecks using dense non-volatile memories (NVMs) and true-3D stacking. However, these works mainly focus on one bottleneck or do not provide a scalable solution that fits future requirements. In this paper, we investigate true-3D stacking of dense, low-power, and refresh-free non-volatile phase change memory (PCM) on top of state-of-the-art GPU configurations to provide higher capacity and bandwidth within the available area and power budget. The higher density and lower power consumption of PCM provide higher capacity through integrating more cells in each 3D layer and enabling stacking more layers. However, we observe that stacking more than six layers of pure-PCM memory violates the thermal constraint and severely harms the performance and power efficiency due to its higher write latency and energy. Further, it degrades the lifetime of GPU to less than one year. Utilizing a hybrid architecture that leverages the benefits of both DRAM and PCM memories has been widely studied by prior proposals; however, true-3D integration of such a hybrid memory architecture especially on top of state-of-the-art powerful GPU architecture has not been investigated yet. We experimentally demonstrate that by covering 80% of write requests in DRAM and eliminating refresh overhead, true-3D stacking of eight 32GB layers of PCM along with two 8GB layers of DRAM is possible resulting in a total of 272GB memory capacity. Based on the explored design requirements, We propose a 3D high-bandwidth high-capacity hybrid memory (H3DM) system utilizing a hybrid-3D (H3D)-aware remapping scheme to reduce expensive PCM writes to under 20% while avoiding DRAM refresh overhead. H3DM improves the performance up to 291% compared to the baseline GPU architecture while remaining within only 3% of an ideal case with DRAM-like access latency, on average. Moreover, by increasing the dataset size above the baseline GPU memory space, H3DM improves performance and power up to 648% and 87% compared to the baseline GPU architecture since it avoids expensive data transfers through off-chip communication links.","PeriodicalId":335883,"journal":{"name":"Proc. ACM Meas. Anal. Comput. Syst.","volume":"593 ","pages":"12:1-12:28"},"PeriodicalIF":0.0,"publicationDate":"2024-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140454162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

StarShip: Mitigating I/O Bottlenecks in Serverless Computing for Scientific Workflows 星舰：缓解科学工作流无服务器计算中的 I/O 瓶颈

Proc. ACM Meas. Anal. Comput. Syst. Pub Date : 2024-02-16 DOI: 10.1145/3639028

Rohan Basu Roy, Devesh Tiwari

引用次数: 0

Thorough Characterization and Analysis of Large Transformer Model Training At-Scale 大型变压器模型训练的全面特征描述和分析

Proc. ACM Meas. Anal. Comput. Syst. Pub Date : 2024-02-16 DOI: 10.1145/3639034

Scott Cheng, Jun-Liang Lin, M. Emani, Siddhisanket Raskar, Sam Foreman, Zhen Xie, Venkat Vishwanath, M. Kandemir

{"title":"Thorough Characterization and Analysis of Large Transformer Model Training At-Scale","authors":"Scott Cheng, Jun-Liang Lin, M. Emani, Siddhisanket Raskar, Sam Foreman, Zhen Xie, Venkat Vishwanath, M. Kandemir","doi":"10.1145/3639034","DOIUrl":"https://doi.org/10.1145/3639034","url":null,"abstract":"Large transformer models have recently achieved great success across various domains. With a growing number of model parameters, a large transformer model training today typically involves model sharding, data parallelism, and model parallelism. Thus, the throughput of large-scale model training depends heavily on the network bandwidth since a combination of model sharding and multiple parallelism strategies incurs various costs. However, prior characterizations of transformer models on high-bandwidth DGX machines that use TFLOPS as a metric may not reflect the performance of a system with lower bandwidth. Furthermore, data and model parallelism reveal significantly distinct training profiles on different system bandwidths at scale and, thus, need a thorough study. In this paper, we provide a bottom-up breakdown of training throughput into compute and communication time, and quantitatively analyze their respective influences on overall end-to-end training scaling. Our evaluation involves an in-depth exploration of data parallelism, scaling up to 512 GPUs with limited bandwidth, and examines three model sharding strategies among six model sizes. We also evaluate three combinations of model parallelism on both high and low bandwidth supercomputing systems. Overall, our work provides a broader perspective on large-scale transformer model training, and our analysis and evaluation yield practical insights for predicting training scaling, shaping the future development of supercomputing system design.","PeriodicalId":335883,"journal":{"name":"Proc. ACM Meas. Anal. Comput. Syst.","volume":"358 1","pages":"8:1-8:25"},"PeriodicalIF":0.0,"publicationDate":"2024-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140454467","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0