Rohit Zambre, M. Grodowitz, Aparna Chandramowlishwaran, Pavel Shamis
{"title":"Breaking Band: A Breakdown of High-performance Communication","authors":"Rohit Zambre, M. Grodowitz, Aparna Chandramowlishwaran, Pavel Shamis","doi":"10.1145/3337821","DOIUrl":"https://doi.org/10.1145/3337821","url":null,"abstract":"The critical path of internode communication on large-scale systems is composed of multiple components. When a supercomputing application initiates the transfer of a message using a high-level communication routine such as an MPI_Send, the payload of the message traverses multiple software stacks, the I/O subsystem on both the host and target nodes, and network components such as the switch. In this paper, we analyze where, why, and how much time is spent on the critical path of communication by modeling the overall injection overhead and end-to-end latency of a system. We focus our analysis on the performance of small messages since fine-grained communication is becoming increasingly important with the growing trend of an increasing number of cores per node. The analytical models present an accurate and detailed breakdown of time spent in internode communication. We validate the models on Arm ThunderX2-based servers connected with Mellanox InfiniBand. This is the first work of this kind on Arm. Alongside our breakdown, we describe the methodology to measure the time spent in each component so that readers with access to precise CPU timers and a PCIe analyzer can measure breakdowns on systems of their interest. Such a breakdown is crucial for software developers, system architects, and researchers to guide their optimization efforts. As researchers ourselves, we use the breakdown to simulate the impacts and discuss the likelihoods of a set of optimizations that target the bottlenecks in today's high-performance communication.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127590231","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yaocheng Xiang, Chencheng Ye, Xiaolin Wang, Yingwei Luo, Zhenlin Wang
{"title":"EMBA","authors":"Yaocheng Xiang, Chencheng Ye, Xiaolin Wang, Yingwei Luo, Zhenlin Wang","doi":"10.1145/3337821.3337863","DOIUrl":"https://doi.org/10.1145/3337821.3337863","url":null,"abstract":"EMBA 604 STRATEGIC ANALYSIS. (2) This course provides a framework of competitive analysis and competitive advantage upon which functionally oriented courses in the program may build. It provides an overall picture of the analysis activities and decision-making situations facing a company’s top management team (i.e., CEOs, general managers, division managers) focusing on top management decisions relating to the external environment and internal issues. It presents practical experience in recognizing what information is important, sifting it for relevance, and employing the knowledge for the competitive benefit of the firm. Prereq: Admission to the joint EMBA program.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124916423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"JobPacker","authors":"Zhuozhao Li, Haiying Shen","doi":"10.1145/3337821.3337880","DOIUrl":"https://doi.org/10.1145/3337821.3337880","url":null,"abstract":"In spite of many advantages of hybrid electrical/optical datacenter networks (Hybrid-DCN), current job schedulers for data-parallel frameworks are not suitable for Hybrid-DCN, since the schedulers do not aggregate data traffic to facilitate using optical circuit switch (OCS). In this paper, we propose JobPacker, a job scheduler for data-parallel frameworks in Hybrid-DCN that aims to take full advantage of OCS to improve job performance. JobPacker aggregates the data transfers of a job in order to use OCS to improve data transfer efficiency. It first explores the tradeoff between parallelism and traffic aggregation for each shuffle-heavy recurring job, and then generates an offline schedule including which racks to run each job and the sequence to run the recurring jobs in each rack that yields the best performance. It has a new sorting method to prioritize recurring jobs in offline-scheduling to prevent high resource contention while fully utilizing cluster resources. In real-time scheduler, JobPacker uses the offline schedule to guide the data placement and schedule recurring jobs, and schedules non-recurring jobs to the idle resources not assigned to recurring jobs. Trace-driven simulation and GENI-based emulation show that JobPacker reduces the makespan up to 49% and the median completion time up to 43%, compared to the state-of-the-art schedulers in Hybrid-DCN.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"45 9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124960498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ian Bogle, K. Devine, M. Perego, S. Rajamanickam, George M. Slota
{"title":"A Parallel Graph Algorithm for Detecting Mesh Singularities in Distributed Memory Ice Sheet Simulations","authors":"Ian Bogle, K. Devine, M. Perego, S. Rajamanickam, George M. Slota","doi":"10.1145/3337821.3337841","DOIUrl":"https://doi.org/10.1145/3337821.3337841","url":null,"abstract":"We present a new, distributed-memory parallel algorithm for detection of degenerate mesh features that can cause singularities in ice sheet mesh simulations. Identifying and removing mesh features such as disconnected components (icebergs) or hinge vertices (peninsulas of ice detached from the land) can significantly improve the convergence of iterative solvers. Because the ice sheet evolves during the course of a simulation, it is important that the detection algorithm can run in situ with the simulation --- running in parallel and taking a negligible amount of computation time --- so that degenerate features (e.g., calving icebergs) can be detected as they develop. We present a distributed memory, BFS-based label-propagation approach to degenerate feature detection that is efficient enough to be called at each step of an ice sheet simulation, while correctly identifying all degenerate features of an ice sheet mesh. Our method finds all degenerate features in a mesh with 13 million vertices in 0.0561 seconds on 1536 cores in the MPAS Albany Land Ice (MALI) model. Compared to the previously used serial pre-processing approach, we observe a 46,000x speedup for our algorithm, and provide additional capability to do dynamic detection of degenerate features in the simulation.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123983392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Data-Parallel Primitives on Heterogeneous Systems","authors":"Zhuohang Lai, Qiong Luo, Xiaolong Xie","doi":"10.1145/3337821.3337920","DOIUrl":"https://doi.org/10.1145/3337821.3337920","url":null,"abstract":"Data-parallel primitives, such as gather, scatter, scan, and split, are widely used in data-intensive applications. However, it is challenging to optimize them on a system consisting of heterogeneous processors. In this paper, we study and compare the existing implementations and optimization strategies for a set of data-parallel primitives on three processors: GPU, CPU and Xeon Phi co-processor. Our goal is to identify the key performance factors in the implementations of data-parallel primitive operations on different architectures and develop general strategies for implementing these primitives efficiently on various platforms. We introduce a portable and efficient sequential memory access pattern, which eliminates the cost of adjusting the memory access pattern for individual device. With proper tuning, our optimized primitive implementations can achieve comparable performance to the native versions. Moreover, our profiling results show that the CPU and the Phi co-processor share most optimization strategies whereas the GPU differs from them significantly, due to the hardware differences among these devices, such as efficiency of vectorization, data and TLB caching, and data prefetching. We summarize these factors and deliver common primitive optimization strategies for heterogeneous systems.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125332264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modeling the Performance of Atomic Primitives on Modern Architectures","authors":"F. Hoseini, A. Atalar, P. Tsigas","doi":"10.1145/3337821.3337901","DOIUrl":"https://doi.org/10.1145/3337821.3337901","url":null,"abstract":"Utilizing the atomic primitives of a processor to access a memory location atomically is key to the correctness and feasibility of parallel software systems. The performance of atomics plays a significant role in the scalability and overall performance of parallel software systems. In this work, we study the performance -in terms of latency, throughput, fairness, energy consumption- of atomic primitives in the context of the two common software execution settings that result in high and low contention access on shared memory. We perform and present an exhaustive study of the performance of atomics in these two application contexts and propose a performance model that captures their behavior. We consider two state-of-the-art architectures: Intel Xeon E5, Xeon Phi (KNL). We propose a model that is centered around the bouncing of cache lines between threads that execute atomic primitives on these shared cache lines. The model is very simple to be used in practice and captures the behavior of atomics accurately under these execution scenarios and facilitate algorithmic design decisions in multi-threaded programming.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127282688","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improved Unconstrained Energy Functional Method for Eigensolvers in Electronic Structure Calculations","authors":"M. D. Ben, O. Marques, A. Canning","doi":"10.1145/3337821.3337914","DOIUrl":"https://doi.org/10.1145/3337821.3337914","url":null,"abstract":"This paper reports on the performance of a preconditioned conjugate gradient based iterative eigensolver using an unconstrained energy functional minimization scheme. In contrast to standard implementations, this scheme avoids an explicit reorthogonalization of the trial eigenvectors and becomes an attractive alternative for the solution of very large problems. The unconstrained formulation is implemented in the first-principles materials and chemistry CP2K code, which performs electronic structure calculations based on a density functional theory approximation to the solution of the many-body Schrödinger equation. We study the convergence of the unconstrained formulation, as well as its parallel scaling, on a Cray XC40 at the National Energy Research Scientific Computing Center (NERSC). The systems we use in our studies are bulk liquid water, a supramolecular catalyst gold(III)-complex, a bilayer of MoS2-WSe2 and a divacancy point defect in silicon, with the number of atoms ranging from 2,247 to 12,288. We show that the unconstrained formulation with an appropriate preconditioner has good convergence properties and scales well to 230k cores, roughly 38% of the full machine.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129540441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Akbar Majidi, Xiaofeng Gao, S. Zhu, Nazila Jahanbakhsh, Guihai Chen
{"title":"Adaptive Routing Reconfigurations to Minimize Flow Cost in SDN-Based Data Center Networks","authors":"Akbar Majidi, Xiaofeng Gao, S. Zhu, Nazila Jahanbakhsh, Guihai Chen","doi":"10.1145/3337821.3337861","DOIUrl":"https://doi.org/10.1145/3337821.3337861","url":null,"abstract":"Data center networks have become heavily reliant on software-defined network to orchestrate data transmission. To maintain optimal network configurations, a controller needs to solve the multi-commodity flow problem and globally update the network under tight time constraints. In this paper, we aim to minimize flow cost or intuitively average transmission delay, under reconfiguration budget constraints in data centers. Thus, we formulate this optimization problem as a constrained Markov Decision Process and propose a set of algorithms to solve it in a scalable manner. We first develop a propagation algorithm to identify the flows which are mostly affected in terms of latency and will be configured in the next network update. Then, we set a limitation range for updating them to improve adaptability and scalability by updating a less number of flows each time to achieve fast operations as well. Further, based on the Drift-Plus-Penalty method in Lyapunov theory, we propose a heuristic policy without prior information of flow demand with a performance guarantee to minimize the additive optimality gap. To the best of our knowledge, this is the first paper that studies the range and frequency of flow reconfigurations, which has both theoretical and practical significance in the related area. Extensive emulations and numerical simulations, which are much better than the estimated theoretical bound, show that our proposed policy outperform the state of the art algorithms in terms of latency by over 45% while making improvements in adaptability and scalability.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127431585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Unleashing the Scalability Potential of Power-Constrained Data Center in the Microservice Era","authors":"Xiaofeng Hou, Jiacheng Liu, Chao Li, M. Guo","doi":"10.1145/3337821.3337857","DOIUrl":"https://doi.org/10.1145/3337821.3337857","url":null,"abstract":"Recent scale-out cloud services have undergone a shift from monolithic applications to microservices by putting each functionality into lightweight software containers. Although traditional data center power optimization frameworks excel at per-server or per-rack management, they can hardly make informed decisions when facing microservices that have different QoS requirements on a per-service basis. In a power-constrained data center, blindly budgeting power usage could lead to a power unbalance issue: microservices on the critical path may not receive adequate power budget. This unavoidably hinders the growth of cloud productivity. To unleash the performance potential of cloud in the microservice era, this paper investigates microservice-aware data center resource management. We model microservice using a bipartite graph and propose a metric called microservice criticality factor (MCF) to measure the overall impact of performance scaling on a microservice from the whole application's perspective. We further devise ServiceFridge, a novel system framework that leverages MCF to jointly orchestrate software containers and control hardware power demand. Our detailed case study on a practical microservice application demonstrates that ServiceFridge allows data center to reduce its dynamic power by 25% with slight performance loss. It improves the mean response time by 25.2% and improves the 90th tail latency by 18.0% compared with existing schemes.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122689414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"OSP: Overlapping Computation and Communication in Parameter Server for Fast Machine Learning","authors":"Haozhao Wang, Song Guo, Ruixuan Li","doi":"10.1145/3337821.3337828","DOIUrl":"https://doi.org/10.1145/3337821.3337828","url":null,"abstract":"When running in Parameter Server (PS), the Distributed Stochastic Gradient Descent (SGD) incurs significant communication delays because after pushing their updates, computing nodes (workers) have to wait for the global model to be communicated back from the master in every iteration. In this paper, we devise a new synchronization parallel mechanism named overlap synchronization parallel (OSP), in which the waiting time is removed by conducting computation and communication in an overlapped manner. We theoretically prove that our mechanism could achieve the same convergence rate compared to the sequential SGD for non-convex problems. Evaluations show that our mechanism significantly improves performance over the state-of-the-art ones, e.g., by 4× for both AlexNet and ResNet18 in terms of convergence speed.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"132 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128512209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}