François-Henry Rouet, C. Ashcraft, J. Dawson, R. Grimes, Erman Guleryuz, S. Koric, R. Lucas, J. Ong, T. Simons, Ting-Ting Zhu
{"title":"Scalability Challenges of an Industrial Implicit Finite Element Code","authors":"François-Henry Rouet, C. Ashcraft, J. Dawson, R. Grimes, Erman Guleryuz, S. Koric, R. Lucas, J. Ong, T. Simons, Ting-Ting Zhu","doi":"10.1109/IPDPS47924.2020.00059","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00059","url":null,"abstract":"LS-DYNA is a well-known multiphysics code with both explicit and implicit time stepping capabilities. Implicit simulations rely heavily on sparse matrix computations, in particular direct solvers, and are notoriously much harder to scale than explicit simulations. In this paper, we investigate the scalability challenges of the implicit structural mode of LS- DYNA. In particular, we focus on linear constraint analysis, sparse matrix reordering, symbolic factorization, and numerical factorization. Our problem of choice for this study is a thermomechanical simulation of jet engine models built by Rolls-Royce with up to 200 million degrees of freedom, or equations. The models are used for engine performance analysis and design optimization, in particular optimization of tip clearances in the compressor and turbine sections of the engine. We present results using as many as 131,072 cores on the Blue Waters Cray XE6/XK7 supercomputer at NCSA and the Titan Cray XK7 supercomputer at OLCF. Since the main focus is on general linear algebra problems, this work is of interest for all linear algebra practitioners, not only developers of implicit finite element codes.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"27 1","pages":"505-514"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75476275","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"PCGCN: Partition-Centric Processing for Accelerating Graph Convolutional Network","authors":"Chao Tian, Lingxiao Ma, Zhi Yang, Yafei Dai","doi":"10.1109/IPDPS47924.2020.00100","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00100","url":null,"abstract":"Inspired by the successes of convolutional neural networks (CNN) in computer vision, the convolutional operation has been moved beyond low-dimension grids (e.g., images) to high-dimensional graph-structured data (e.g., web graphs, social networks), leading to graph convolutional network (GCN). And GCN has been gaining popularity due to its success in real-world applications such as recommendation, natural language processing, etc. Because neural network and graph propagation have high computation complexity, GPUs have been introduced to both neural network training and graph processing. However, it is notoriously difficult to perform efficient GCN computing on data parallel hardware like GPU due to the sparsity and irregularity in graphs. In this paper, we present PCGCN, a novel and general method to accelerate GCN computing by taking advantage of the locality in graphs. We experimentally demonstrate that real-world graphs usually have the clustering property that can be used to enhance the data locality in GCN computing. Then, PCGCN proposes to partition the whole graph into chunks according to locality and process subgraphs with a dual-mode computing strategy which includes a selective and a full processing methods for sparse and dense subgraphs, respectively. Compared to existing state-of-the-art implementations of GCN on real-world and synthetic datasets, our implementation on top of TensorFlow achieves up to 8.8× speedup over the fastest one of the baselines.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"936-945"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75620806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Self-Optimized Generic Workload Prediction Framework for Cloud Computing","authors":"V. Jayakumar, Jaewoo Lee, I. Kim, Wei Wang","doi":"10.1109/IPDPS47924.2020.00085","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00085","url":null,"abstract":"The accurate prediction of the future workload, such as the job arrival rate and the user request rate, is critical to the efficiency of resource management and elasticity in the cloud. However, designing a generic workload predictor that works properly for various types of workload is very challenging due to the large variety of workload patterns and the dynamic changes within a workload. Because of these challenges, existing workload predictors are usually hand-tuned for specific (types of) workloads for maximum accuracy. This necessity to individually tune the predictors also makes it very difficult to reproduce the results from prior research, as the predictor designs have a strong dependency on the workloads.In this paper, we present a novel generic workload prediction framework, LoadDynamics, that can provide high accuracy predictions for any workloads. LoadDynamics employs Long-Short-Term-Memory models and can automatically optimize its internal parameters for an individual workload to achieve high prediction accuracy. We evaluated LoadDynamics with a mixture of workload traces representing public cloud applications, scientific applications, data center jobs and web applications. The evaluation results show that LoadDynamics have only 18% prediction error on average, which is at least 6.7% lower than state-of-the-art workload prediction techniques. The error of LoadDynamics was also only 1% higher than the best predictor found by exhaustive search for each workload. When applied in the Google Cloud, LoadDynamics-enabled auto-scaling policy also outperformed the state-of-the-art predictors by reducing the job turnaround time by at least 24.6% and reducing virtual machine over-provisioning by at least 4.8%.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"96 1","pages":"779-788"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75852413","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Seung-Hwan Lim, Ross G. Miller, Sudharshan S. Vazhkudai
{"title":"Understanding the Interplay between Hardware Errors and User Job Characteristics on the Titan Supercomputer","authors":"Seung-Hwan Lim, Ross G. Miller, Sudharshan S. Vazhkudai","doi":"10.1109/IPDPS47924.2020.00028","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00028","url":null,"abstract":"Designing dependable supercomputers begins with an understanding of errors in real-world, large-scale systems. The Titan supercomputer at Oak Ridge National Laboratory provides a unique opportunity to investigate errors when an actual system is actively used by multiple concurrent users and workloads from diverse domains at varying scales. This study presents a thorough analysis of 6, 908, 497 hardware errors from 18, 688 compute nodes of Titan for 312, 215 user jobs over a 3-year time period. Through careful joining of two system logs – the Machine Check Architecture (MCA) log and the job scheduler log – we show the correlated pattern of hardware errors for each job and user, in addition to individual descriptive statistics of errors, jobs, and users. Since the majority of hardware errors are memory errors, this study also shows the importance of error correcting in memory systems.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"478 1","pages":"180-190"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79946078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bin Dong, V. R. Tribaldos, Xin Xing, S. Byna, J. Ajo-Franklin, Kesheng Wu
{"title":"DASSA: Parallel DAS Data Storage and Analysis for Subsurface Event Detection","authors":"Bin Dong, V. R. Tribaldos, Xin Xing, S. Byna, J. Ajo-Franklin, Kesheng Wu","doi":"10.1109/IPDPS47924.2020.00035","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00035","url":null,"abstract":"Recently developed distributed acoustic sensing (DAS) technologies convert fiber-optic cables into large arrays of subsurface sensors, enabling a variety of applications including earthquake detection and environmental characterization. However, DAS systems produce voluminous datasets sampled at high spatial-temporal resolution and consequently, discovering useful geophysical knowledge within these large-scale data becomes a nearly impossible task for geophysicists. It is appealing to use supercomputers for DAS data analysis, as modern supercomputers are capable of performing over a hundred quadrillion FLOPS operations and have access to exabytes of storage space. Unfortunately, the majority of geophysical data processing libraries are not geared towards these supercomputer environments. This paper introduces a parallel DAS Data Storage and Analysis (DASSA) framework to enable easy-to-use and parallel DAS data analysis on modern supercomputers. DASSA uses a hybrid (i.e., MPI and OpenMP) data analysis execution engine that supports a user-defined function (UDF) interface for various operations and automatically parallelizes them for supercomputer execution. DASSA also provides novel data storage and access strategies, such as communication-avoiding parallel I/O, to reduce the cost of retrieving large DAS data for analysis. Compared with existing data analysis pipelines used by the geophysical community, DASSA is 16× faster and can efficiently scale up to 1456 computing nodes with 11648 CPU cores.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"6 1","pages":"254-263"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73976374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Tightening Up the Incentive Ratio for Resource Sharing Over the Rings","authors":"Y. Cheng, Xiaotie Deng, Yuhao Li","doi":"10.1109/IPDPS47924.2020.00023","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00023","url":null,"abstract":"Fundamental issues in resource sharing over large scale networks have gained much attention from the research community, in response to the growth of sharing economy over the Internet and mobile networks. We are particularly interested in the fundamental file sharing and subsequently P2P network bandwidth sharing developed by BitTorrent and later formalized by Wu and Zhang [15] as the proportional response protocol. It is of practical importance in the design to provide agent incentives to follow the distributed protocol out of their own rationality. We study the robustness of the distributed protocol in this incentive issue against a Sybil attack, a common type of grave threat in P2P network. For the resource sharing on rings, and we characterize the utility gain from a Sybil attack in the concept of incentive ratio. Previous works proved the incentive ratio is lower bounded by two and upper bounded by four, and later the upper bound is improved to three. It has been listed in [5] and [9] as an open problem to tighten them. In this paper, we completely resolve this open problem with a better understanding on the influence from different class agents to the resource allocation under the distributed protocol.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"6 1","pages":"127-136"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78022333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analysis of a List Scheduling Algorithm for Task Graphs on Two Types of Resources","authors":"Lionel Eyraud-Dubois, Suraj Kumar","doi":"10.1109/IPDPS47924.2020.00110","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00110","url":null,"abstract":"We consider the problem of scheduling task graphs on two types of unrelated resources, which arises in the context of task-based runtime systems on modern platforms containing CPUs and GPUs. In this paper, we focus on an algorithm named HeteroPrio, which was originally introduced as an efficient heuristic for a particular application. HeteroPrio is an adaptation of the well known list scheduling algorithm, in which the tasks are picked by the resources in the order of their acceleration factor. This algorithm is augmented with a spoliation mechanism: a task assigned by the list algorithm can later on be reassigned to a different resource if it allows to finish this task earlier.We propose here the first theoretical analysis of the HeteroPrio algorithm in the presence of dependencies. More specifically, if the platform contains m and n processors of each type, we show that the worst-case approximation ratio of HeteroPrio is between $1 + max left( {frac{m}{n},frac{n}{m}} right)$ and $2 + max left( {frac{m}{n},frac{n}{m}} right)$. Our proof structure allows to precisely identify the necessary conditions on the spoliation strategy to obtain such a guarantee. We also present an in-depth experimental analysis, comparing several such spoliation strategies, and comparing HeteroPrio with other algorithms from the literature. Although the worst case analysis shows the possibility of pathological behavior, HeteroPrio is able to produce, in very reasonable time, schedules of significantly better quality.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"53 1","pages":"1041-1050"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90885723","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jun Yi, Chengliang Zhang, Wei Wang, Cheng Li, Feng Yan
{"title":"Not All Explorations Are Equal: Harnessing Heterogeneous Profiling Cost for Efficient MLaaS Training","authors":"Jun Yi, Chengliang Zhang, Wei Wang, Cheng Li, Feng Yan","doi":"10.1109/IPDPS47924.2020.00051","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00051","url":null,"abstract":"Machine-Learning-as-a-Service (MLaaS) enables practitioners and AI service providers to train and deploy ML models in the cloud using diverse and scalable compute resources. A common problem for MLaaS users is to choose from a variety of training deployment options, notably scale-up (using more capable instances) and scale-out (using more instances), subject to the budget limits and/or time constraints. State-of-the-art (SOTA) approaches employ analytical modeling for finding the optimal deployment strategy. However, they have limited applicability as they must be tailored to specific ML model architectures, training framework, and hardware. To quickly adapt to the fast evolving design of ML models and hardware infrastructure, we propose a new Bayesian Optimization (BO) based method HeterBO for exploring the optimal deployment of training jobs. Unlike the existing BO approaches for general applications, we consider the heterogeneous exploration cost and machine learning specific prior to significantly improve the search efficiency. This paper culminates in a fully automated MLaaS training Cloud Deployment system (MLCD) driven by the highly efficient HeterBO search method. We have extensively evaluated MLCD in AWS EC2, and the experimental results show that MLCD outperforms two SOTA baselines, conventional BO and CherryPick, by 3.1× and 2.34×, respectively.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"419-428"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89922097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ClusterSR: Cluster-Aware Scattered Repair in Erasure-Coded Storage","authors":"Zhirong Shen, J. Shu, Zhijie Huang, Yingxun Fu","doi":"10.1109/IPDPS47924.2020.00015","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00015","url":null,"abstract":"Erasure coding is a storage-efficient means to guarantee data reliability in today’s commodity storage systems, yet its repair performance is seriously hindered by the substantial repair traffic. Repair in clustered storage systems is even complicated because of the scarcity of the cross-cluster bandwidth. We present ClusterSR, a cluster-aware scattered repair approach. ClusterSR minimizes the cross-cluster repair traffic by carefully choosing the clusters for reading and repairing chunks. It further balances the cross-cluster repair traffic by scheduling the repair of multiple chunks. Large-scale simulation and Alibaba Cloud ECS experiments show that ClusterSR can reduce 6.7-52.7% of the cross-cluster repair traffic and improve 14.1-68.8% of the repair throughput.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"24 1","pages":"42-51"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87628935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"StragglerHelper: Alleviating Straggling in Computing Clusters via Sharing Memory Access Patterns","authors":"Wenjie Liu, Ping Huang, Xubin He","doi":"10.1109/IPDPS47924.2020.00068","DOIUrl":"https://doi.org/10.1109/IPDPS47924.2020.00068","url":null,"abstract":"Clusters have been a prevalent and successful computing framework for processing large amount of data due to their distributed and parallelized working paradigm. A task submitted to a cluster is typically divided into a number of subtasks which are designated to different work nodes running the same code but dealing with different equal portion of the dataset to be processed. Due to the existence of heterogeneity, it could easily result in stragglers unfairly slowing down the entire processing, because work nodes finish their subtasks at different rates. In this study, we aim to speed up straggling work nodes to quicken the overall processing by leveraging exhibited performance variation. More specifically, we propose StragglerHelper which conveys the memory access characteristics experienced by the forerunner to the stragglers such that stragglers can be sped up due to the accurately informed memory prefetching. A Progress Monitor is deployed to supervise the respective progresses of the work nodes and inform the memory access patterns of forerunner to straggling nodes. Our evaluation results with the SPEC MPI 2007 and BigDataBench on a cluster of 64 work nodes have shown that StragglerHelper is able to improve the execution time of stragglers by up to 99.5% with an average of 61.4%, contributing to an overall improvement of the entire cohort of the cluster by up to 46.7% with an average of 9.9% compared to the baseline cluster.","PeriodicalId":6805,"journal":{"name":"2020 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"32 1","pages":"602-611"},"PeriodicalIF":0.0,"publicationDate":"2020-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88170851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}