2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献_第8页

Why Globally Re-shuffle? Revisiting Data Shuffling in Large Scale Deep Learning 为什么全局重新洗牌?回顾大规模深度学习中的数据洗牌

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00109

Thao Nguyen, François Trahay, Jens Domke, Aleksandr Drozd, Emil, Vatai, Jianwei Liao, M. Wahib, Balazs Gerofi

{"title":"Why Globally Re-shuffle? Revisiting Data Shuffling in Large Scale Deep Learning","authors":"Thao Nguyen, François Trahay, Jens Domke, Aleksandr Drozd, Emil, Vatai, Jianwei Liao, M. Wahib, Balazs Gerofi","doi":"10.1109/ipdps53621.2022.00109","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00109","url":null,"abstract":"Stochastic gradient descent (SGD) is the most prevalent algorithm for training Deep Neural Networks (DNN). SGD iterates the input data set in each training epoch processing data samples in a random access fashion. Because this puts enormous pressure on the I/O subsystem, the most common approach to distributed SGD in HPC environments is to replicate the entire dataset to node local SSDs. However, due to rapidly growing data set sizes this approach has become increasingly infeasible. Surprisingly, the questions of why and to what extent random access is required have not received a lot of attention in the literature from an empirical standpoint. In this paper, we revisit data shuffling in DL workloads to investigate the viability of partitioning the dataset among workers and performing only a partial distributed exchange of samples in each training epoch. Through extensive experiments on up to 2,048 GPUs of ABCI and 4,096 compute nodes of Fugaku, we demonstrate that in practice validation accuracy of global shuffling can be maintained when carefully tuning the partial distributed exchange. We provide a solution implemented in PyTorch that enables users to control the proposed data exchange scheme.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"19 21","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113935495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Co-Designing an OpenMP GPU Runtime and Optimizations for Near-Zero Overhead Execution 共同设计一个OpenMP GPU运行时和优化近零开销执行

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00055

J. Doerfert, Atmn Patel, Joseph Huber, Shilei Tian, J. M. Diaz, Barbara M. Chapman, G. Georgakoudis

{"title":"Co-Designing an OpenMP GPU Runtime and Optimizations for Near-Zero Overhead Execution","authors":"J. Doerfert, Atmn Patel, Joseph Huber, Shilei Tian, J. M. Diaz, Barbara M. Chapman, G. Georgakoudis","doi":"10.1109/ipdps53621.2022.00055","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00055","url":null,"abstract":"GPU accelerators are ubiquitous in modern HPC systems. To program them, users have the choice between vendor-specific, native programming models, such as CUDA, which provide simple parallelism semantics with minimal runtime support, or portable alternatives, such as OpenMP, which offer rich parallel semantics and feature an extensive runtime library to support execution. While the operations of such a runtime can easily limit performance and drain resources, it was to some degree regarded an unavoidable overhead. In this work we present a co-design methodology for optimizing applications using a specifically crafted OpenMP GPU runtime such that most use cases induce near-zero overhead. Specifically, our approach exposes runtime semantics and state to the compiler such that optimization effectively eliminating abstractions and runtime state from the final binary. With the help of user provided assumptions we can further optimize common patterns that otherwise increase resource consumption. We evaluated our prototype build on top of the LLVM/OpenMP GPU offloading infrastructure with multiple HPC proxy applications and benchmarks. Comparison of CUDA, the original OpenMP runtime, and our co-designed alternative show that, by our approach, performance is significantly improved and resource consumption is significantly lowered. Oftentimes we can closely match the CUDA implementation without sacrificing the versatility and portability of OpenMP.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125091976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Memory-Aware Scheduling of Tasks Sharing Data on Multiple GPUs with Dynamic Runtime Systems 基于动态运行时系统的多gpu共享数据任务的内存感知调度

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00073

Maxime Gonthier, L. Marchal, Samuel Thibault

{"title":"Memory-Aware Scheduling of Tasks Sharing Data on Multiple GPUs with Dynamic Runtime Systems","authors":"Maxime Gonthier, L. Marchal, Samuel Thibault","doi":"10.1109/ipdps53621.2022.00073","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00073","url":null,"abstract":"The use of accelerators such as GPUs has become mainstream to achieve high performance on modern computing systems. GPUs come with their own (limited) memory and are connected to the main memory of the machine through a bus (with limited bandwidth). When a computation is started on a GPU, the corresponding data needs to be transferred to the GPU before the computation starts. Such data movements may become a bottleneck for performance, especially when several GPUs have to share the communication bus. Task-based runtime schedulers have emerged as a convenient and efficient way to use such heterogeneous platforms. When processing an application, the scheduler has the knowledge of all tasks available for processing on a GPU, as well as their input data dependencies. Hence, it is able to choose which task to allocate to which GPU and to reorder tasks so as to minimize data movements. We focus on this problem of partitioning and ordering tasks that share some of their input data. We present a novel dynamic strategy based on data selection to efficiently allocate tasks to GPUs and a custom eviction policy, and compare them to existing strategies using either a well-known graph partitioner or standard scheduling techniques in runtime systems. We also improved an offline scheduler recently proposed for a single GPU, by adding load balancing and task stealing capabilities. All strategies have been implemented on top of the STARPU runtime, and we show that our dynamic strategy achieves better performance when scheduling tasks on multiple GPU s with limited memory.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126168535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Fast Convergence to Fairness for Reduced Long Flow Tail Latency in Datacenter Networks 降低数据中心网络长流尾延迟的快速收敛公平性

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00102

John Snyder, A. Lebeck

{"title":"Fast Convergence to Fairness for Reduced Long Flow Tail Latency in Datacenter Networks","authors":"John Snyder, A. Lebeck","doi":"10.1109/ipdps53621.2022.00102","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00102","url":null,"abstract":"Many data-intensive applications, such as distributed deep learning and data analytics, require moving vast amounts of data between compute servers in a distributed system. To meet the demands of these applications, datacenters are adopting Remote Direct Memory Access (RDMA), which has higher bandwidth and lower latency than traditional kernel-based networking. To ensure high performance of RDMA networks, congestion control manages queue depth on switches, and historically focused on moderating queue depth to ensure small flows complete quickly. Unfortunately, one side-effect of many common decisions is that large flows are starved of bandwidth. This negatively impacts the flow completion time (FCT) of large, bandwidth-bound flows, which are integral to the performance of data-intensive applications. The FCT is particularly impacted at the tail, which is increasingly critical for predictable application performance. We identify the root causes of the poor performance for long flows and measure the impact. We then design mechanisms that improve long flow FCT without compromising small flow performance. Our evaluations show that these improvements reduce 99.9% tail FCT of long flows by over 2x.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126983738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

I/O-Optimal Cache-Oblivious Sparse Matrix-Sparse Matrix Multiplication I/ o最优缓存无关稀疏矩阵-稀疏矩阵乘法

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00013

Niels Gleinig, Maciej Besta, T. Hoefler

{"title":"I/O-Optimal Cache-Oblivious Sparse Matrix-Sparse Matrix Multiplication","authors":"Niels Gleinig, Maciej Besta, T. Hoefler","doi":"10.1109/ipdps53621.2022.00013","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00013","url":null,"abstract":"Data movements between different levels of the memory hierarchy (I/O-transitions, or simply I/O s) are a critical performance bottleneck in modern computing. Therefore it is a problem of high practical relevance to find algorithms that use a minimal number of I/O s. We present a cache-oblivious sparse matrix-sparse matrix multiplication algorithm that uses a worst-case number of I/O s that matches a previously established lower bound for this problem (0 (N2/B.M) read-I/Os and 0 (N2/B) write-I/Os, where $N$ is the size of the problem instance, $M$ is the size of the fast memory and $B$ is the size of the cache lines). When the output does not need to be stored, also the number of write-I/Os can be reduced to 0 (N2/B.M). This improves the worst-case I/O-complexity of the previously best known algorithm for this problem (which is cache-aware) by a logarithmic multiplicative factor. Compared to other cache-oblivious algorithms our algorithm improves the worst-case number of I/Os by a multiplicative factor of Θ(M. N). We show how the algorithm can be applied to produce the first I/O-efficient solution for the sparse 2- vs 3-diameter problem on sparse directed graphs.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"498 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116696343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Modeling Matrix Engines for Portability and Performance 矩阵引擎的可移植性和性能建模

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00117

Nicholai Tukanov, Rajalakshmi Srinivasaraghavan, J. Moreira, Tze Meng Low

{"title":"Modeling Matrix Engines for Portability and Performance","authors":"Nicholai Tukanov, Rajalakshmi Srinivasaraghavan, J. Moreira, Tze Meng Low","doi":"10.1109/ipdps53621.2022.00117","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00117","url":null,"abstract":"Matrix engines, also known as matrix-multiplication accel-erators, capable of computing on 2D matrices of various data types are traditionally found only on GPUs. However, they are increasingly being introduced into CPU architectures to support AI/ML computations. Unlike traditional SIMD functional units, these accelerators require both the input and output data to be packed into a specific 2D-data layout that is often dependent on the input and output data types. Due to the large variety of supported data types and architectures, a common abstraction is required to unify these seemingly disparate accelerators and more efficiently produce high-performance code. In this paper, we show that the hardware characteristics of a vast array of different matrix engines can be unified using a single analytical model that casts matrix engines as an accumulation of multiple outer-products (also known as rank-k updates). This allows us to easily and quickly develop high-performance kernels using matrix engines for different architectures. We demonstrate our matrix engine model and its portability by applying it to two distinct architectures. Using our model, we show that high-performance computational kernels and packing routines required for high-performance dense linear algebra libraries can be easily designed. Furthermore, we show that the performance attained by our implementations is around 90–99 % (80–95 % on large problems) of the theoretical peak throughput of the matrix engines.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116460411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

The Fast and Scalable MPI Application Launch of the Tianhe HPC system 天河高性能计算系统快速可扩展MPI应用启动

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00129

Yiqin Dai, Yong Dong, Min Xie, Kai Lu, Ruibo Wang, Mingtian Shao, Juan Chen

{"title":"The Fast and Scalable MPI Application Launch of the Tianhe HPC system","authors":"Yiqin Dai, Yong Dong, Min Xie, Kai Lu, Ruibo Wang, Mingtian Shao, Juan Chen","doi":"10.1109/ipdps53621.2022.00129","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00129","url":null,"abstract":"Fast and scalable MPI application launch helps achieve exascale performance and is becoming a common goal in high-performance computing. However, the traditional launch technique suffers from scalability deficiencies in the global information exchange and the global barrier operation. This drawback makes it challenging to launch MPI applications quickly in large-scale systems. In this paper, we propose a fast and scalable application launch technique and details its associated hardware and software support. The optimized launch technique includes a locality-aware static address generation rule for eliminating the need for address exchange and a topology-aware global communication scheme for improving global communication efficiency. We also propose an optimized application launch sequence for supporting the above launch technique. We implement and evaluate the proposed launch technique on the Tianhe-2A supercomputer and the Tianhe Exascale Prototype Upgrade System. Experimental results show that our technique can reduce the launch time by 26.1% when launching an application with 256K processes.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128349458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An End-to-end and Adaptive I/O Optimization Tool for Modern HPC Storage Systems 面向现代高性能计算存储系统的端到端自适应I/O优化工具

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00128

Bin Yang, Yanliang Zou, Weiguo Liu, Wei Xue

{"title":"An End-to-end and Adaptive I/O Optimization Tool for Modern HPC Storage Systems","authors":"Bin Yang, Yanliang Zou, Weiguo Liu, Wei Xue","doi":"10.1109/ipdps53621.2022.00128","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00128","url":null,"abstract":"Real-world large-scale applications expose more and more pressures to storage services of modern supercomputers. Supercomputers have been introducing new storage devices and technologies to meet the performance requirements of various applications, leading to more complicated architectures. High I/O demand of applications and the complicated and shared storage architectures make the issues, such as unbalanced load, I/O interference, system parameter configuration error, and node performance degradation, more frequently observed. And it is challenging to both achieve high I/O performance on application level and efficiently utilize scarce storage resources. We propose AIOT, an end-to-end and adaptive I/O optimization tool for HPC storage systems, which introduces effective I/O performance modeling and several active tuning strategies to improve both the I/O performance of applications and the utilization of storage resources. AIOT provides a global view of the whole storage system and searches for the optimal end-to-end I/O path through flow network modeling. Moreover, AIOT tunes system parameters across multiple layers of the storage system by using the automated identified application I/O behaviors and the instant status of the workload of storage system. We verified the effectiveness of AIOT for balancing I/O load, resolving I/O interference, improving I/O performance by configuring appropriate system parameters, and avoiding I/O performance degradation caused by abnormal nodes through quite a few real-world cases. AIOT has helped to save over ten millions of core-hours during the deployment on Sunway TaihuLight since July 2021. It's worth mentioning that our proposed AIOT is capable of managing other I/O optimization methods across various storage platforms.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126025664","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

SPIDER: An Effective, Efficient and Robust Load Scheduler for Real-time Split Frame Rendering 蜘蛛:一个有效的，高效的和鲁棒的负载调度实时分割帧渲染

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00071

Bin Ma, Ziqiang Zhang, Yusen Li, Wentong Cai, Gang Wang, Xiaoguang Liu

{"title":"SPIDER: An Effective, Efficient and Robust Load Scheduler for Real-time Split Frame Rendering","authors":"Bin Ma, Ziqiang Zhang, Yusen Li, Wentong Cai, Gang Wang, Xiaoguang Liu","doi":"10.1109/ipdps53621.2022.00071","DOIUrl":"https://doi.org/10.1109/ipdps53621.2022.00071","url":null,"abstract":"Interactive graphics applications are generally latency-critical, while using multiple GPUs to accelerate such applications becomes possible recently with support from both hardware and software. Split frame rendering (SFR) is a popular approach for multi-GPU rendering, which splits a frame into disjoint regions and assigns the regions to different GPUs. Load scheduling of SFR is a crucial but challenging issue for achieving maximum rendering performance in real-time rendering, which is not well addressed by the existing solutions. In this paper, we propose SPIDER, a load scheduler which leverages fuzzy PID (proportional integral derivative) controller to schedule the rendering workload among GPUs. SPIDER has several distinguished properties: it is a feedback based mechanism which does not need a full knowledge of the dynamic system; it is very computationally efficient and easy to implement; it is highly robust to dynamic workload changes. Extensive experiments are conducted to evaluate SPIDER and the results show that SPIDER always achieves near-optimal performance for various workload patterns, which outperforms the state-of-the-art baselines signif-icantly.","PeriodicalId":321801,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122727956","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Falcon: A Timestamp-based Protocol to Maximize the Cache Efficiency in the Distributed Shared Memory Falcon:一种基于时间戳的协议，以最大化分布式共享内存中的缓存效率

2022 IEEE International Parallel and Distributed Processing Symposium (IPDPS) Pub Date : 2022-05-01 DOI: 10.1109/ipdps53621.2022.00099

Jin Zhang, Xiangyao Yu, Zhengwei Qi, Haibing Guan

引用次数: 1