Swarnendu Biswas, Rui Zhang, Michael D. Bond, Brandon Lucia
{"title":"Rethinking Support for Region Conflict Exceptions","authors":"Swarnendu Biswas, Rui Zhang, Michael D. Bond, Brandon Lucia","doi":"10.1109/IPDPS.2019.00116","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00116","url":null,"abstract":"Current shared-memory systems provide well-defined execution semantics only for data-race-free executions. A state-of-the-art technique called Conflict Exceptions (CE) extends M(O) ESI-based coherence to provide defined semantics to all program executions. However, CE incurs significant performance costs because of its need to frequently access metadata in memory. In this work, we explore designs for practical architecture support for region conflict exceptions. First, we propose an on-chip metadata cache called access information memory (AIM) to reduce memory accesses in CE. The extended design is called CE+. In spite of the AIM, CE+ stresses or saturates the on-chip interconnect and the off-chip memory network bandwidth because of its reliance on eager write-invalidation-based coherence. We explore whether detecting conflicts is potentially better suited to cache coherence based on release consistency and self-invalidation, rather than M(O) ESI-based coherence. We realize this insight in a novel architecture design called ARC. Our evaluation shows that CE+ improves the run-time performance and energy usage over CE for several applications across different core counts, but can suffer performance penalties from network saturation. ARC generally outperforms CE, and is competitive with CE+ on average while stressing the on-chip interconnect and off-chip memory network much less, showing that coherence based on release consistency and self-invalidation is well suited to detecting region conflicts.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128661858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MULTISKIPGRAPH: A Self-Stabilizing Overlay Network that Maintains Monotonic Searchability","authors":"Linghui Luo, C. Scheideler, Thim Strothmann","doi":"10.1109/IPDPS.2019.00093","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00093","url":null,"abstract":"Self-stabilizing overlay networks have the advantage of being able to recover from illegal states and faults. However, the majority of these networks cannot give any guarantees on their functionality while the recovery process is going on. We are especially interested in searchability, i.e., the functionality that search messages for a specific node are answered successfully if a node exists in the network. In this paper we investigate overlay networks that ensure the maintenance of monotonic searchability while the self-stabilization is going on. More precisely, once a search message from node u to another node v is successfully delivered, all future search messages from u to v succeed as well. We extend the existing research by focusing on skip graphs and present a solution for two scenarios: (i) the goal topology is a super graph of the perfect skip graph and (ii) the goal topology is exactly the perfect skip graph.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129175069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhipeng Li, Min Lv, Yinlong Xu, Yongkun Li, Liangliang Xu
{"title":"D3: Deterministic Data Distribution for Efficient Data Reconstruction in Erasure-Coded Distributed Storage Systems","authors":"Zhipeng Li, Min Lv, Yinlong Xu, Yongkun Li, Liangliang Xu","doi":"10.1109/IPDPS.2019.00064","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00064","url":null,"abstract":"Due to individual unreliable commodity components, failures are common in large-scale distributed storage systems. Erasure codes are widely deployed in practical storage systems to provide fault tolerance with low storage overhead. However, the commonly used random data placement in storage systems based on erasure codes induces to heavy cross-rack traffic, load imbalance, and random access, which slow down the recovery process upon failures. In this paper, with orthogonal arrays, we define a Deterministic Data Distribution (D^3) of blocks to nodes and racks, and propose an efficient failure recovery approach based on D^3. D^3 not only uniformly distributes data/parity blocks among storage servers, but also balances the repair traffic among racks and storage servers for failure recovery. Furthermore, D^3 also minimizes the cross-rack repair traffic for data layouts against a single rack failure and provides sequential access for failure recovery. We implement D3 in Hadoop Distributed File System (HDFS) with a cluster of 28 machines. Our experiments show that D^3 significantly speeds up the failure recovery process compared with random data distribution, e.g., 2.21 times for (6, 3)-RS code in a system consisting of eight racks and three nodes in each rack.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129873918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Israt Nisa, Jiajia Li, Aravind Sukumaran-Rajam, R. Vuduc, P. Sadayappan
{"title":"Load-Balanced Sparse MTTKRP on GPUs","authors":"Israt Nisa, Jiajia Li, Aravind Sukumaran-Rajam, R. Vuduc, P. Sadayappan","doi":"10.1109/IPDPS.2019.00023","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00023","url":null,"abstract":"Sparse matricized tensor times Khatri-Rao product (MTTKRP) is one of the most computationally expensive kernels in sparse tensor computations. This work focuses on optimizing the MTTKRP operation on GPUs, addressing both performance and storage requirements. We begin by identifying the performance bottlenecks in directly extending the state-of-the-art CSF (compressed sparse fiber) format from CPUs to GPUs. A significant challenge with GPUs compared to multicore CPUs is that of utilizing the much greater degree of parallelism in a load-balanced fashion for irregular computations like sparse MTTKRP. To address this issue, we develop a new storage-efficient representation for tensors that enables high-performance, load-balanced execution of MTTKRP on GPUs. A GPU implementation of sparse MTTKRP using the new sparse tensor representation is shown to outperform all currently known parallel sparse CPU and GPU MTTKRP implementations.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129269260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nikoli Dryden, N. Maruyama, Tom Benson, Tim Moon, M. Snir, B. V. Essen
{"title":"Improving Strong-Scaling of CNN Training by Exploiting Finer-Grained Parallelism","authors":"Nikoli Dryden, N. Maruyama, Tom Benson, Tim Moon, M. Snir, B. V. Essen","doi":"10.1109/IPDPS.2019.00031","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00031","url":null,"abstract":"Scaling CNN training is necessary to keep up with growing datasets and reduce training time. We also see an emerging need to handle datasets with very large samples, where memory requirements for training are large. Existing training frameworks use a data-parallel approach that partitions samples within a mini-batch, but limits to scaling the mini-batch size and memory consumption makes this untenable for large samples. We describe and implement new approaches to convolution, which parallelize using spatial decomposition or a combination of sample and spatial decomposition. This introduces many performance knobs for a network, so we develop a performance model for CNNs and present a method for using it to automatically determine efficient parallelization strategies. We evaluate our algorithms with microbenchmarks and image classification with ResNet-50. Our algorithms allow us to prototype a model for a mesh-tangling dataset, where sample sizes are very large. We show that our parallelization achieves excellent strong and weak scaling and enables training for previously unreachable datasets.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130279763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tal Ben-Nun, Maciej Besta, Simon Huber, A. Ziogas, D. Peter, T. Hoefler
{"title":"A Modular Benchmarking Infrastructure for High-Performance and Reproducible Deep Learning","authors":"Tal Ben-Nun, Maciej Besta, Simon Huber, A. Ziogas, D. Peter, T. Hoefler","doi":"10.1109/IPDPS.2019.00018","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00018","url":null,"abstract":"We introduce Deep500: the first customizable benchmarking infrastructure that enables fair comparison of the plethora of deep learning frameworks, algorithms, libraries, and techniques. The key idea behind Deep500 is its modular design, where deep learning is factorized into four distinct levels: operators, network processing, training, and distributed training. Our evaluation illustrates that Deep500 is customizable (enables combining and benchmarking different deep learning codes) and fair (uses carefully selected metrics). Moreover, Deep500 is fast (incurs negligible overheads), verifiable (offers infrastructure to analyze correctness), and reproducible. Finally, as the first distributed and reproducible benchmarking system for deep learning, Deep500 provides software infrastructure to utilize the most powerful supercomputers for extreme-scale workloads.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125714605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Robust Dynamic Resource Allocation via Probabilistic Task Pruning in Heterogeneous Computing Systems","authors":"James Gentry, Chavit Denninnart, M. Salehi","doi":"10.1109/IPDPS.2019.00047","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00047","url":null,"abstract":"In heterogeneous distributed computing (HC) systems, diversity can exist in both computational resources and arriving tasks. In an inconsistently heterogeneous computing system, task types have different execution times on heterogeneous machines. A method is required to map arriving tasks to machines based on machine availability and performance, maximizing the number of tasks meeting deadlines (defined as robustness). For tasks with hard deadlines (e.g., those in live video streaming), tasks that miss their deadlines are dropped. The problem investigated in this research is maximizing the robustness of an oversubscribed HC system. A way to maximize this robustness is to prune (i.e., defer or drop) tasks with low probability of meeting their deadlines to increase the probability of other tasks meeting their deadlines. In this paper, we first provide a mathematical model to estimate a task's probability of meeting its deadline in the presence of task dropping. We then investigate methods for engaging probabilistic dropping and we find thresholds for dropping and deferring. Next, we develop a pruning-aware mapping heuristic and extend it to engender fairness across various task types. We show the cost benefit of using probabilistic pruning in an HC system. Simulation results, harnessing a selection of mapping heuristics, show efficacy of the pruning mechanism in improving robustness (on average by around 25%) and cost in an oversubscribed HC system by up to around 40%.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"460 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124346947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. D. Girolamo, Pirmin Schmid, T. Schulthess, T. Hoefler
{"title":"SimFS: A Simulation Data Virtualizing File System Interface","authors":"S. D. Girolamo, Pirmin Schmid, T. Schulthess, T. Hoefler","doi":"10.1109/IPDPS.2019.00071","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00071","url":null,"abstract":"Nowadays simulations can produce petabytes of data to be stored in parallel filesystems or large-scale databases. This data is accessed over the course of decades often by thousands of analysts and scientists. However, storing these volumes of data for long periods of time is not cost effective and, in some cases, practically impossible. We propose to transparently virtualize the simulation data, relaxing the storage requirements by not storing the full output and re-simulating the missing data on demand. We develop SimFS, a file system interface that exposes a virtualized view of the simulation output to the analysis applications and manages the re-simulations. SimFS monitors the access patterns of the analysis applications in order to (1) decide the data to keep stored for faster accesses and (2) to employ prefetching strategies to reduce the access time of missing data. Virtualizing simulation data allows us to trade storage for computation: this paradigm becomes similar to traditional on-disk analysis (all data is stored) or in situ (no data is stored) according with the storage resources that are assigned to SimFS. Overall, by exploiting the growing computing power and relaxing the storage capacity requirements, SimFS offers a viable path towards exa-scale simulations.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131117523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ryan Chard, Zhuozhao Li, K. Chard, Logan T. Ward, Y. Babuji, A. Woodard, S. Tuecke, B. Blaiszik, M. Franklin, Ian T Foster
{"title":"DLHub: Model and Data Serving for Science","authors":"Ryan Chard, Zhuozhao Li, K. Chard, Logan T. Ward, Y. Babuji, A. Woodard, S. Tuecke, B. Blaiszik, M. Franklin, Ian T Foster","doi":"10.1109/IPDPS.2019.00038","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00038","url":null,"abstract":"While the Machine Learning (ML) landscape is evolving rapidly, there has been a relative lag in the development of the \"learning systems\" needed to enable broad adoption. Furthermore, few such systems are designed to support the specialized requirements of scientific ML. Here we present the Data and Learning Hub for science (DLHub), a multi-tenant system that provides both model repository and serving capabilities with a focus on science applications. DLHub addresses two significant shortcomings in current systems. First, its self-service model repository allows users to share, publish, verify, reproduce, and reuse models, and addresses concerns related to model reproducibility by packaging and distributing models and all constituent components. Second, it implements scalable and low-latency serving capabilities that can leverage parallel and distributed computing resources to democratize access to published models through a simple web interface. Unlike other model serving frameworks, DLHub can store and serve any Python 3-compatible model or processing function, plus multiple-function pipelines. We show that relative to other model serving systems including TensorFlow Serving, SageMaker, and Clipper, DLHub provides greater capabilities, comparable performance without memoization and batching, and significantly better performance when the latter two techniques can be employed. We also describe early uses of DLHub for scientific applications.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114967162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MD-GAN: Multi-Discriminator Generative Adversarial Networks for Distributed Datasets","authors":"Corentin Hardy, E. L. Merrer, B. Sericola","doi":"10.1109/IPDPS.2019.00095","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00095","url":null,"abstract":"A recent technical breakthrough in the domain of machine learning is the discovery and the multiple applications of Generative Adversarial Networks (GANs). Those generative models are computationally demanding, as a GAN is composed of two deep neural networks, and because it trains on large datasets. A GAN is generally trained on a single server. In this paper, we address the problem of distributing GANs so that they are able to train over datasets that are spread on multiple workers. MD-GAN is exposed as the first solution for this problem: we propose a novel learning procedure for GANs so that they fit this distributed setup. We then compare the performance of MD-GAN to an adapted version of federated learning to GANs, using the MNIST, CIFAR10 and CelebA datasets. MD-GAN exhibits a reduction by a factor of two of the learning complexity on each worker node, while providing better or identical performances with the adaptation of federated learning. We finally discuss the practical implications of distributing GANs.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114270725","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}