ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming最新文献_第4页

POSTER: Enabling Extreme-Scale Phase Field Simulation with In-situ Feature Extraction 海报：利用现场特征提取技术实现极大规模相场模拟

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2024-02-20 DOI: 10.1145/3627535.3638486

Zhichen Feng, Jialin Li, Yaqian Gao, Shaobo Tian, Huang Ye, Jian Zhang

引用次数: 0

Exploiting Fine-Grained Redundancy in Set-Centric Graph Pattern Mining 在以集合为中心的图模式挖掘中利用细粒度冗余

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2024-02-20 DOI: 10.1145/3627535.3638507

Zhiheng Lin, Ke Meng, Chaoyang Shui, Kewei Zhang, Junmin Xiao, Guangming Tan

引用次数: 0

A Row Decomposition-based Approach for Sparse Matrix Multiplication on GPUs 基于行分解的 GPU 稀疏矩阵乘法方法

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2024-02-20 DOI: 10.1145/3627535.3638470

Meng Pang, Xiang Fei, Peng Qu, Youhui Zhang, Zhaolin Li

引用次数: 0

Pure: Evolving Message Passing To Better Leverage Shared Memory Within Nodes 纯净：发展消息传递，更好地利用节点内的共享内存

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2024-02-20 DOI: 10.1145/3627535.3638503

James Psota, Armando Solar-Lezama

引用次数: 0

Are Your Epochs Too Epic? Batch Free Can Be Harmful 您的纪元太史诗了吗？批量免费可能有害

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2024-01-20 DOI: 10.1145/3627535.3638491

Daewoo Kim, T. Brown, Ajay Singh

{"title":"Are Your Epochs Too Epic? Batch Free Can Be Harmful","authors":"Daewoo Kim, T. Brown, Ajay Singh","doi":"10.1145/3627535.3638491","DOIUrl":"https://doi.org/10.1145/3627535.3638491","url":null,"abstract":"Epoch based memory reclamation (EBR) is one of the most popular techniques for reclaiming memory in lock-free and optimistic locking data structures, due to its ease of use and good performance in practice. However, EBR is known to be sensitive to thread delays, which can result in performance degradation. Moreover, the exact mechanism for this performance degradation is not well understood. This paper illustrates this performance degradation in a popular data structure benchmark, and does a deep dive to uncover its root cause-a subtle interaction between EBR and state of the art memory allocators. In essence, modern allocators attempt to reduce the overhead of freeing by maintaining bounded thread caches of objects for local reuse, actually freeing them (a very high latency operation) only when thread caches become too large. EBR immediately bypasses these mechanisms whenever a particularly large batch of objects is freed, substantially increasing overheads and latencies. Beyond EBR, many memory reclamation algorithms, and data structures, that reclaim objects in large batches suffer similar deleterious interactions with popular allocators. We propose a simple algorithmic fix for such algorithms to amortize the freeing of large object batches over time, and apply this technique to ten existing memory reclamation algorithms, observing performance improvements for nine out of ten, and over 50% improvement for six out of ten in experiments on a high performance lock-free ABtree. We also present an extremely simple token passing variant of EBR and show that, with our fix, it performs 1.5-2.6x faster than the fastest known memory reclamation algorithm, and 1.2-1.5x faster than not reclaiming at all, on a 192 thread four socket Intel system.","PeriodicalId":286119,"journal":{"name":"ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming","volume":"67 2","pages":"30-41"},"PeriodicalIF":0.0,"publicationDate":"2024-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140501649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fast Kronecker Matrix-Matrix Multiplication on GPUs GPU 上的快速 Kronecker 矩阵-矩阵乘法

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2024-01-18 DOI: 10.1145/3627535.3638489

Abhinav Jangda, Mohit Yadav

引用次数: 0

Coarse grain parallelization of deep neural networks 深度神经网络的粗粒度并行化

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2016-02-27 DOI: 10.1145/3016078.2851158

Marc González

引用次数: 7

A tool to analyze the performance of multithreaded programs on NUMA architectures 一个分析NUMA架构上多线程程序性能的工具

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2014-02-06 DOI: 10.1145/2555243.2555271

Xu Liu, J. Mellor-Crummey

{"title":"A tool to analyze the performance of multithreaded programs on NUMA architectures","authors":"Xu Liu, J. Mellor-Crummey","doi":"10.1145/2555243.2555271","DOIUrl":"https://doi.org/10.1145/2555243.2555271","url":null,"abstract":"Almost all of today's microprocessors contain memory controllers and directly attach to memory. Modern multiprocessor systems support non-uniform memory access (NUMA): it is faster for a microprocessor to access memory that is directly attached than it is to access memory attached to another processor. Without careful distribution of computation and data, a multithreaded program running on such a system may have high average memory access latency. To use multiprocessor systems efficiently, programmers need performance tools to guide the design of NUMA-aware codes. To address this need, we enhanced the HPCToolkit performance tools to support measurement and analysis of performance problems on multiprocessor systems with multiple NUMA domains. With these extensions, HPCToolkit helps pinpoint, quantify, and analyze NUMA bottlenecks in executions of multithreaded programs. It computes derived metrics to assess the severity of bottlenecks, analyzes memory accesses, and provides a wealth of information to guide NUMA optimization, including information about how to distribute data to reduce access latency and minimize contention. This paper describes the design and implementation of our extensions to HPCToolkit. We demonstrate their utility by describing case studies in which we use these capabilities to diagnose NUMA bottlenecks in four multithreaded applications.","PeriodicalId":286119,"journal":{"name":"ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115308648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 69

Race directed scheduling of concurrent programs 并发程序的竞争导向调度

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2014-02-06 DOI: 10.1145/2555243.2555263

Mahdi Eslamimehr, J. Palsberg

引用次数: 45

Extracting logical structure and identifying stragglers in parallel execution traces 提取逻辑结构并识别并行执行轨迹中的掉队者

ACM SIGPLAN Symposium on Principles & Practice of Parallel Programming Pub Date : 2014-02-06 DOI: 10.1145/2555243.2555288

Katherine E. Isaacs, T. Gamblin, A. Bhatele, P. Bremer, M. Schulz, B. Hamann

引用次数: 3