{"title":"Exploiting Vector Processing in Dynamic Binary Translation","authors":"Chih-Min Lin, Sheng-Yu Fu, Ding-Yong Hong, Yu-Ping Liu, Jan-Jan Wu, W. Hsu","doi":"10.1145/3337821.3337844","DOIUrl":"https://doi.org/10.1145/3337821.3337844","url":null,"abstract":"Auto vectorization techniques have been adopted by compilers to exploit data-level parallelism in parallel processing for decades. However, since processor architectures have kept enhancing with new features to improve vector/SIMD performance, legacy application binaries failed to fully exploit new vector/SIMD capabilities in modern architectures. For example, legacy ARMv7 binaries cannot benefit from ARMv8 SIMD double precision capability, and legacy x86 binaries cannot enjoy the power of AVX-512 extensions. In this paper, we study the fundamental issues involved in cross-ISA Dynamic Binary Translation (DBT) to convert non-vectorized loops to vector/SIMD forms to achieve greater computation throughput available in newer processor architectures. The key idea is to recover critical loop information from those application binaries in order to carry out vectorization at runtime. Experiment results show that our approach achieves an average speedup of 1.42x compared to ARMv7 native run across various benchmarks in an ARMv7-to-ARMv8 dynamic binary translation system.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123972308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Data-Parallel Primitives on Heterogeneous Systems","authors":"Zhuohang Lai, Qiong Luo, Xiaolong Xie","doi":"10.1145/3337821.3337920","DOIUrl":"https://doi.org/10.1145/3337821.3337920","url":null,"abstract":"Data-parallel primitives, such as gather, scatter, scan, and split, are widely used in data-intensive applications. However, it is challenging to optimize them on a system consisting of heterogeneous processors. In this paper, we study and compare the existing implementations and optimization strategies for a set of data-parallel primitives on three processors: GPU, CPU and Xeon Phi co-processor. Our goal is to identify the key performance factors in the implementations of data-parallel primitive operations on different architectures and develop general strategies for implementing these primitives efficiently on various platforms. We introduce a portable and efficient sequential memory access pattern, which eliminates the cost of adjusting the memory access pattern for individual device. With proper tuning, our optimized primitive implementations can achieve comparable performance to the native versions. Moreover, our profiling results show that the CPU and the Phi co-processor share most optimization strategies whereas the GPU differs from them significantly, due to the hardware differences among these devices, such as efficiency of vectorization, data and TLB caching, and data prefetching. We summarize these factors and deliver common primitive optimization strategies for heterogeneous systems.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125332264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yaocheng Xiang, Chencheng Ye, Xiaolin Wang, Yingwei Luo, Zhenlin Wang
{"title":"EMBA","authors":"Yaocheng Xiang, Chencheng Ye, Xiaolin Wang, Yingwei Luo, Zhenlin Wang","doi":"10.1145/3337821.3337863","DOIUrl":"https://doi.org/10.1145/3337821.3337863","url":null,"abstract":"EMBA 604 STRATEGIC ANALYSIS. (2) This course provides a framework of competitive analysis and competitive advantage upon which functionally oriented courses in the program may build. It provides an overall picture of the analysis activities and decision-making situations facing a company’s top management team (i.e., CEOs, general managers, division managers) focusing on top management decisions relating to the external environment and internal issues. It presents practical experience in recognizing what information is important, sifting it for relevance, and employing the knowledge for the competitive benefit of the firm. Prereq: Admission to the joint EMBA program.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124916423","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"JobPacker","authors":"Zhuozhao Li, Haiying Shen","doi":"10.1145/3337821.3337880","DOIUrl":"https://doi.org/10.1145/3337821.3337880","url":null,"abstract":"In spite of many advantages of hybrid electrical/optical datacenter networks (Hybrid-DCN), current job schedulers for data-parallel frameworks are not suitable for Hybrid-DCN, since the schedulers do not aggregate data traffic to facilitate using optical circuit switch (OCS). In this paper, we propose JobPacker, a job scheduler for data-parallel frameworks in Hybrid-DCN that aims to take full advantage of OCS to improve job performance. JobPacker aggregates the data transfers of a job in order to use OCS to improve data transfer efficiency. It first explores the tradeoff between parallelism and traffic aggregation for each shuffle-heavy recurring job, and then generates an offline schedule including which racks to run each job and the sequence to run the recurring jobs in each rack that yields the best performance. It has a new sorting method to prioritize recurring jobs in offline-scheduling to prevent high resource contention while fully utilizing cluster resources. In real-time scheduler, JobPacker uses the offline schedule to guide the data placement and schedule recurring jobs, and schedules non-recurring jobs to the idle resources not assigned to recurring jobs. Trace-driven simulation and GENI-based emulation show that JobPacker reduces the makespan up to 49% and the median completion time up to 43%, compared to the state-of-the-art schedulers in Hybrid-DCN.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"45 9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124960498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cosin","authors":"Jingya Zhou, Jianxi Fan, Jin Wang","doi":"10.1145/3337821.3337858","DOIUrl":"https://doi.org/10.1145/3337821.3337858","url":null,"abstract":"Influence Maximization (IM) has been extensively applied to many fields, and the viral marketing in today's online social networks (OSNs) is one of the most famous applications, where a group of seed users are selected to activate more users in a distributed cascading fashion. Many prior work explore the IM problem based on the assumption of given budget. However, the budget assumption does not hold in many practical scenarios, since companies might have no sufficient prior knowledge about the market. Moreover, companies prefer a moderately controllable viral marketing that allows them to adjust marketing decision according to the market reaction. In this paper, we propose a new problem, called Controllable social influence maximization (Cosin), to find a set of seed users inside a controllable scope to maximize the benefit given an expected return on investment (ROI). Like the IM problem, the Cosin problem is also NP-hard. We present a distributed multi-hop based framework for the influence estimation, and design a (1/2 + ϵ)-approximate algorithm based on the proposed framework. Moreover, we further present a distributed implementation to accelerate the execution of algorithm for large-scale social networks. Extensive experiments with a billion-scale social network indicate that the proposed algorithms outperform state-of-the-art algorithms in both benefit and running time.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130358346","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improved Unconstrained Energy Functional Method for Eigensolvers in Electronic Structure Calculations","authors":"M. D. Ben, O. Marques, A. Canning","doi":"10.1145/3337821.3337914","DOIUrl":"https://doi.org/10.1145/3337821.3337914","url":null,"abstract":"This paper reports on the performance of a preconditioned conjugate gradient based iterative eigensolver using an unconstrained energy functional minimization scheme. In contrast to standard implementations, this scheme avoids an explicit reorthogonalization of the trial eigenvectors and becomes an attractive alternative for the solution of very large problems. The unconstrained formulation is implemented in the first-principles materials and chemistry CP2K code, which performs electronic structure calculations based on a density functional theory approximation to the solution of the many-body Schrödinger equation. We study the convergence of the unconstrained formulation, as well as its parallel scaling, on a Cray XC40 at the National Energy Research Scientific Computing Center (NERSC). The systems we use in our studies are bulk liquid water, a supramolecular catalyst gold(III)-complex, a bilayer of MoS2-WSe2 and a divacancy point defect in silicon, with the number of atoms ranging from 2,247 to 12,288. We show that the unconstrained formulation with an appropriate preconditioner has good convergence properties and scales well to 230k cores, roughly 38% of the full machine.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129540441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"OSP: Overlapping Computation and Communication in Parameter Server for Fast Machine Learning","authors":"Haozhao Wang, Song Guo, Ruixuan Li","doi":"10.1145/3337821.3337828","DOIUrl":"https://doi.org/10.1145/3337821.3337828","url":null,"abstract":"When running in Parameter Server (PS), the Distributed Stochastic Gradient Descent (SGD) incurs significant communication delays because after pushing their updates, computing nodes (workers) have to wait for the global model to be communicated back from the master in every iteration. In this paper, we devise a new synchronization parallel mechanism named overlap synchronization parallel (OSP), in which the waiting time is removed by conducting computation and communication in an overlapped manner. We theoretically prove that our mechanism could achieve the same convergence rate compared to the sequential SGD for non-convex problems. Evaluations show that our mechanism significantly improves performance over the state-of-the-art ones, e.g., by 4× for both AlexNet and ResNet18 in terms of convergence speed.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"132 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128512209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AdaM","authors":"Shiyi Cao, Yuanning Gao, Xiaofeng Gao, Guihai Chen","doi":"10.1145/3337821.3337822","DOIUrl":"https://doi.org/10.1145/3337821.3337822","url":null,"abstract":"Distributed metadata management, administrating the distribution of metadata nodes on different metadata servers (MDS's), can substantially improve overall performance of large-scale distributed storage systems if well designed. A major difficulty confronting many metadata management schemes is the trade-off between two conflicting aspects: system load balance and metadata locality preservation. It becomes even more challenging as file access pattern inevitably varies with time. However, existing works dynamically reallocate nodes to different servers adopting history-based coarse-grained methods, failing to make timely and efficient update on distribution of nodes. In this paper, we propose an adaptive fine-grained metadata management scheme, AdaM, leveraging Deep Reinforcement Learning, to address the trade-off dilemma against time-varying access pattern. At each time step, AdaM collects environmental \"states\" including access pattern, the structure of namespace tree and current distribution of nodes on MDS's. Then an actor-critic network is trained to reallocate hot metadata nodes to different servers according to the observed \"states\". Adaptive to varying access pattern, AdaM can automatically migrate hot metadata nodes among servers to keep load balancing while maintaining metadata locality. We test AdaM on real-world data traces. Experimental results demonstrate the superiority of our proposed method over other schemes.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128190948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Machine Learning for Fine-Grained Hardware Prefetcher Control","authors":"Jason Hiebel, Laura E. Brown, Zhenlin Wang","doi":"10.1145/3337821.3337854","DOIUrl":"https://doi.org/10.1145/3337821.3337854","url":null,"abstract":"Modern architectures provide hardware memory prefetching capabilities which can be configured at runtime. While hardware prefetching can provide substantial performance improvements for many programs, prefetching can also increase contention for shared resources such as last-level cache and memory bandwidth. In turn, this contention can degrade performance in multi-core workloads. In this paper, we model fine-grained hardware prefetcher control as a contextual bandit, and propose a framework for learning prefetcher control policies which adjust hardware prefetching usage at runtime according to workload performance behavior. We train our policies on profiling data, wherein hardware memory prefetchers are enabled or disabled randomly at regular intervals over the course of a workload's execution. The learned prefetcher control policies provide up to a 4.3% average performance improvement over a set of memory bandwidth intensive workloads.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128817438","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ian Bogle, K. Devine, M. Perego, S. Rajamanickam, George M. Slota
{"title":"A Parallel Graph Algorithm for Detecting Mesh Singularities in Distributed Memory Ice Sheet Simulations","authors":"Ian Bogle, K. Devine, M. Perego, S. Rajamanickam, George M. Slota","doi":"10.1145/3337821.3337841","DOIUrl":"https://doi.org/10.1145/3337821.3337841","url":null,"abstract":"We present a new, distributed-memory parallel algorithm for detection of degenerate mesh features that can cause singularities in ice sheet mesh simulations. Identifying and removing mesh features such as disconnected components (icebergs) or hinge vertices (peninsulas of ice detached from the land) can significantly improve the convergence of iterative solvers. Because the ice sheet evolves during the course of a simulation, it is important that the detection algorithm can run in situ with the simulation --- running in parallel and taking a negligible amount of computation time --- so that degenerate features (e.g., calving icebergs) can be detected as they develop. We present a distributed memory, BFS-based label-propagation approach to degenerate feature detection that is efficient enough to be called at each step of an ice sheet simulation, while correctly identifying all degenerate features of an ice sheet mesh. Our method finds all degenerate features in a mesh with 13 million vertices in 0.0561 seconds on 1536 cores in the MPAS Albany Land Ice (MALI) model. Compared to the previously used serial pre-processing approach, we observe a 46,000x speedup for our algorithm, and provide additional capability to do dynamic detection of degenerate features in the simulation.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"115 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123983392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}