{"title":"Informed Prefetching in I/O Bounded Distributed Deep Learning","authors":"X. Ruan, Haiquan Chen","doi":"10.1109/IPDPSW52791.2021.00127","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00127","url":null,"abstract":"Deep learning research has been growing rapidly in the past decade for the significant performance improvement on GPUs. While the computing capability of current GPUs is tremendous, data pre-processing/loading becomes a potential bottleneck that incurs major training latency and adds overhead in both CPU and memory, especially when datasets are too large to fit in memory. When datasets are stripped on distributed file systems, access to a remote storage node may deteriorate I/O performance significantly due to network I/O latency in cloud. Moreover, some deep learning workloads may be assigned to remote GPU servers in Edge Computing which results in even higher network I/O latency. Therefore, it is desirable to provide efficient parallel and distributed prefetching solution which is able to reduce the I/O cost of data pre-processing before feeding the data into GPUs for training on distributed storage systems of Cloud or Edge. Although the current deep learning frameworks like PyTorch or TensorFlow offer multiprocessing data loading functionalities, their approaches come at the price of high computing resource usage and memory usage. In this paper, we presented a novel thread-level Informed Prefetching Data Loader framework, IPDL, in support of efficient data prefetching from remote storage nodes in distributed deep learning environments and possibly in Edge Computing. Compared to its counterparts in PyTorch, IPDL is able to provide accelerated I/O performance for data loading while consuming lower computing resource and memory space at the same time. Extensive experiments on both an individual server and a cluster computing system have shown the superiority of IPDL over the latest implementation of PyTorch.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129727513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CPRIC: Collaborative Parallelism for Randomized Incremental Constructions","authors":"Florian Fey, S. Gorlatch","doi":"10.1109/IPDPSW52791.2021.00081","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00081","url":null,"abstract":"Randomized algorithms often outperform their deterministic counterparts in terms of simplicity and efficiency. In this paper, we consider Randomized Incremental Constructions (RICs) that are very popular, in particular in combinatorial optimization and computational geometry. Our contribution is Collaborative Parallel RIC (CPRIC) –a novel approach to parallelizing RIC for modern parallel architectures like vector processors and GPUs. We show that our approach based on a work-stealing mechanism avoids the control-flow divergence of parallel threads, thus improving the performance of parallel implementation. Our extensive experiments on CPU and GPU demonstrate the advantages of our CPRIC approach that achieves an average speedup between 4× and 5× compared to the naively parallelized RIC.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127610628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Measuring Cache Complexity Using Data Movement Distance (DMD)","authors":"Donovan Snyder, C. Ding","doi":"10.1109/IPDPSW52791.2021.00070","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00070","url":null,"abstract":"Given the ubiquity of cache-based machines, it is important to analyze how well a program or an algorithm uses the cache. There is no widely accepted measure of cache complexity, yet the cache complexity is often more important to performance than the measures of time and space complexity. This paper presents Data Movement Distance (DMD) to measure the cost of cache complexity for algorithms, demonstrates its use, and discusses it as a measure of locality. Since processor speeds are getting ever faster, one of the main bottlenecks in modern computing is moving the needed data into and around the processor. DMD measures the efficiency of the algorithm in this sense and therefore may be a much-needed complement to the conventional analysis of computation complexity. In this paper, we give an overview of DMD and some basic results. These will be expanded upon in future work.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124315761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Message from the EduPar-21 Workshop Chair","authors":"","doi":"10.1109/ipdpsw52791.2021.00055","DOIUrl":"https://doi.org/10.1109/ipdpsw52791.2021.00055","url":null,"abstract":"","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123912611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhe Wang, P. Subedi, Matthieu Dorier, Philip E. Davis, M. Parashar
{"title":"Facilitating Staging-based Unstructured Mesh Processing to Support Hybrid In-Situ Workflows","authors":"Zhe Wang, P. Subedi, Matthieu Dorier, Philip E. Davis, M. Parashar","doi":"10.1109/IPDPSW52791.2021.00152","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00152","url":null,"abstract":"In-situ and in-transit processing alleviate the gap between the computing and I/O capabilities by scheduling data analytics close to the data source. Hybrid in-situ processing splits data analytics into two stages: the data processing that runs in-situ aims to extract regions of interest, which are then transferred to staging services for further in-transit analytics. To facilitate this type of hybrid in-situ processing, the data staging service needs to support complex intermediate data representations generated/consumed by the in-situ tasks. Unstructured (or irregular) mesh is one such derived data representation that is typically used and bridges simulation data and analytics. However, how staging services efficiently support unstructured mesh transfer and processing remains to be explored. This paper investigates design options for transferring and processing unstructured mesh data using staging services. Using polygonal mesh data as an example, we show that hybrid in-situ workflows with staging-based unstructured mesh processing can effectively support hybrid in-situ workflows, and can significantly decrease data movement overheads.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"60 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131107660","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Autonomous Load Balancing in Distributed Hash Tables Using Churn and the Sybil Attack","authors":"Andrew Rosen, Benjamin Levin, A. Bourgeois","doi":"10.1109/IPDPSW52791.2021.00097","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00097","url":null,"abstract":"Distributed Hash Tables (DHTs) are an integral foundation for a variety of modern internet applications. In previous work, we have shown that DHTs can also be used as a means of organizing a large number of workers to tackle large-scale computing problems in a fault tolerant context. Whether a DHT is being used for file access or distributing a large-scale computing job, a cryptographic hash function is used to assign keys for nodes and data. Ideally, these would be uniformly distributed across the available range, thus evenly distributing the nodes and tasks. However, this is rarely the case in practice and as a result, the workload can become highly unbalanced. To address this issue, there have been numerous methods proposed for load balancing DHTs, but often they are a centralized approach.In this paper, we present four methods to autonomously balance the load of DHTs: 1) induced churn; 2) random injection of Sybil Nodes; 3) neighbor injection; and 4) invitation of nodes with low workloads. Each approach is completely decentralized, requiring minimal overhead, with individual nodes making decisions based only upon local information. What makes our approach unique is that the strategies rely on using the inherent churn in a DHT or by a variation of the Sybil attack to balance the workload. We simulate the four strategies on a Chord DHT and show they significantly rebalance the workload in a DHT. The strategy of randomly injecting virtual \"Sybil\" nodes performed the best in terms of balance and speedup.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131385506","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic Selection of Tensor Decomposition for Compressing Convolutional Neural Networks A Case Study on VGG-type Networks","authors":"Chia-Chun Liang, Che-Rung Lee","doi":"10.1109/IPDPSW52791.2021.00115","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00115","url":null,"abstract":"Tensor decomposition is one of the model reduction techniques for compressing deep neural networks. Existing methods use either Tucker decomposition (TD) or Canonical Polyadic decomposition (CPD) for model compression, but none of them tried to combine those two methods, owing to the complexity of choosing a proper decomposition method for each layer. In this paper, we adopted the automatic tuning technique to design an algorithm that can mix both tensor decomposition methods, called Mixed Tensor Decomposition (MTD). The goal is to achieve better compression ratio while keeping similar accuracy as the original models. We used VGG type networks for the case study since they are relatively heavy and computationally expensive. We first studied the relation of model accuracy and compression ratio for Tucker and CPD applying to convolution neural networks (CNN). Based on the studied results, we designed a strategy to select the most suitable decomposition method for each layer, and further fine-tunes the models to recover the accuracy. We have conducted experiments using VGG11 and VGG16 with CIFAR10 dataset, and compared MTD with other tensor decomposition algorithms. The results show that MTD can achieve compression ratio 32 × and 37 × for VGG11 and VGG16 respectively with less than 1% accuracy drops, which is much better than the state-of-the-art tensor decomposition algorithms for model compression.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126945283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Y. Sasaki, Ayumu Ishizuka, Mulya Agung, H. Takizawa
{"title":"Evaluating I/O Acceleration Mechanisms of SX-Aurora TSUBASA","authors":"Y. Sasaki, Ayumu Ishizuka, Mulya Agung, H. Takizawa","doi":"10.1109/IPDPSW52791.2021.00113","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00113","url":null,"abstract":"In a heterogeneous computing system, different kinds of processors might need to be involved in the execution of a file I/O operation. Since NEC SX-Aurora TSUBASA is one such system, two I/O acceleration mechanisms are offered to reduce the data transfer overheads among the processors for a file I/O operation. This paper first investigates the effects of the two mechanisms on the I/O performance of SX-Aurora TSUBASA. Considering the results, proper use of the two mechanisms is discussed via a real-world application of flood damage estimation. These results clearly demonstrate the demand for auto-tuning, i.e., adaptively selecting either of the two mechanisms with considering application behaviors and system configuration.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114573844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable and Highly Available Multi-Objective Neural Architecture Search in Bare Metal Kubernetes Cluster","authors":"Andreas Klos, Marius Rosenbaum, W. Schiffmann","doi":"10.1109/IPDPSW52791.2021.00094","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00094","url":null,"abstract":"The interest in deep neural networks for solving computer vision task has dramatically increased. Due to the heavy influence of the neural networks architecture on its predictive accuracy, neural architecture search has gained much attention in recent years. This research area typically implies a high computational burden and thus, requires high scalability as well as availability to ensure no data loss or waist of computational power. Moreover, the thinking of developing applications has changed from monolithic once to microservices. Hence, we developed a highly scalable and available multi-objective neural architecture search and adopted to the modern thinking of developing application by subdividing an already existing, monolithic neural architecture search – based on a genetic algorithm – into microservices. Furthermore, we adopted the initial population creation by 1,000 mutations of each individual, extended the approach by inception layers, implemented it as island model to facilitate scalability and achieved on MNIST, Fashion-MNIST and CIFAR-10 dataset 99.75%, 94.35% and 89.90% test accuracy respectively. Besides, our model is strongly focused on high availability empowered by the deployment in our bare-metal Kubernetes cluster. Our results show that the introduced multi-objective neural architecture search can easily handle even the loss of nodes and proceed the algorithm within seconds on another node without any loss of results or the necessity of human interaction.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122159980","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Makoto Morishita, S. Ohshima, T. Katagiri, Toru Nagai
{"title":"Parallelization of GKV benchmark using OpenACC","authors":"Makoto Morishita, S. Ohshima, T. Katagiri, Toru Nagai","doi":"10.1109/IPDPSW52791.2021.00109","DOIUrl":"https://doi.org/10.1109/IPDPSW52791.2021.00109","url":null,"abstract":"The computing power of the Graphics Processing Unit (GPU) has received great attention in recent years, as 140 supercomputers with NVIDIA GPUs were ranked in the TOP500 for November 2020 [1]. However, CUDA, which is widely used in GPU programming, needs to be written at a low level and often requires the specialized knowledge of the GPU memory hierarchy and execution models. In this study, we used OpenACC [2], which semi-automatically generates kernel code by inserting directives into a program to speed up the application. The target application was benchmark program based on the plasma turbulence analysis code, gyrokinetic Vlasov code (GKV). With our implementation of OpenACC, kernel2, kernel3, and kernel4 of the benchmark were 31.43, 7.08, and 10.74 times faster, respectively, compared to CPU sequential execution. Thus, we succeeded in increasing the applications’ speed. In the future, we will port the rest of the code to the GPU environment to run the entire GKV on GPUs.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115408498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}