S. Seal, S. Hirshman, A. Wingen, R. Wilcox, M. Cianciosa, E. Unterberg
{"title":"PARVMEC: An Efficient, Scalable Implementation of the Variational Moments Equilibrium Code","authors":"S. Seal, S. Hirshman, A. Wingen, R. Wilcox, M. Cianciosa, E. Unterberg","doi":"10.1109/ICPP.2016.77","DOIUrl":"https://doi.org/10.1109/ICPP.2016.77","url":null,"abstract":"The ability to sustain magnetically confined plasma in a state of stable equilibrium is crucial for optimal and cost-effective operations of fusion devices like tokamaks and stellarators. The Variational Moments Equilibrium Code (VMEC) is the de-facto serial application used by fusion scientists to compute magnetohydrodynamics (MHD) equilibria and study the physics of three dimensional plasmas in confined configurations. Modern fusion energy experiments have larger system scales with more interactive experimental workflows, both demanding faster analysis turnaround times on computational workloads that are stressing the capabilities of sequential VMEC. In this paper, we present PARVMEC, an efficient, parallel version of its sequential counterpart, capable of scaling to thousands of processors on distributed memory machines. PARVMEC is a non-linear code, with multiple numerical physics modules, each with its own computational complexity. A detailed speedup analysis supported by scaling results on 1,024 cores of a Cray XC30 supercomputer is presented. Depending on the mode of PARVMEC execution, speedup improvements of one to two orders of magnitude are reported. PARVMEC equips fusion scientists for the first time with a state-of-the-art capability for rapid, high fidelity analyses of magnetically confined plasmas at unprecedented scales.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115701405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"HppCnn: A High-Performance, Portable Deep-Learning Library for GPGPUs","authors":"Yi Yang, Min Feng, S. Chakradhar","doi":"10.1109/ICPP.2016.73","DOIUrl":"https://doi.org/10.1109/ICPP.2016.73","url":null,"abstract":"The massively parallel computation capability has made GPGPUs a promising platform for convolutional neural networks (CNNs). In this paper, we present HppCnn, a CNN library achieves both the high performance and portability on GPGPUs. In HppCnn, we propose a novel three-step approach to implement convolutional kernels using Nvidia cuBLAS efficiently. To overcome limitations of our three-step approach, we improve cuBLAS by enabling nested parallelism, and implement a low-cost auto-tuning module to leveraging existing libraries in the runtime. The experiments show HppCnn achieves significant speedups over both other cuBLAS-based and hand-optimized solutions. The results also show our solution delivers near-optimal performance on GPUs with the portability.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122449491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tingwei Zhu, D. Feng, Yu Hua, F. Wang, Qingyu Shi, Jiahao Liu
{"title":"MIC: An Efficient Anonymous Communication System in Data Center Networks","authors":"Tingwei Zhu, D. Feng, Yu Hua, F. Wang, Qingyu Shi, Jiahao Liu","doi":"10.1109/ICPP.2016.9","DOIUrl":"https://doi.org/10.1109/ICPP.2016.9","url":null,"abstract":"With the rapid growth of application migration, the anonymity in data center networks becomes important in breaking attack chains and guaranteeing user privacy. However, existing anonymity systems are designed for the Internet environment, which suffer from high computational and network resource consumption and deliver low performance, thus failing to be directly deployed in data centers. In order to address this problem, this paper proposes an efficient and easily deployed anonymity scheme for SDN-based data centers, called MIC. The main idea behind MIC is to conceal the communication participants by modifying the source/destination addresses (such as MAC, IP and port) at switch nodes, so as to achieve anonymity. Compared with the traditional overlay-based approaches, our in-network scheme has shorter transmission paths and less intermediate operations, thus achieving higher performance with less overhead. We also propose a collision avoidance mechanism to ensure the correctness of routing, and two mechanisms to enhance the traffic-analysis resistance. Our security analysis demonstrates that MIC ensures unlinkability and improves traffic-analysis resistance. Our experiments show that MIC has extremely low overhead compared with the base-line TCP (or SSL), e.g., less than 1% overhead in terms of throughput.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116232459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Thread Similarity Matrix: Visualizing Branch Divergence in GPGPU Programs","authors":"Zhibin Yu, L. Eeckhout, Chengzhong Xu","doi":"10.1109/ICPP.2016.27","DOIUrl":"https://doi.org/10.1109/ICPP.2016.27","url":null,"abstract":"Graphics processing units (GPUs) have recently evolved into popular accelerators for general-purpose parallel programs -- so-called GPGPU computing. Although programming models such as CUDA and OpenCL significantly improve GPGPU programmability, optimizing GPGPU programs is still far from trivial. Branch divergence is one of the root causes reducing GPGPU performance. Existing approaches are able to calculate the branch divergence rate but are unable to reveal how the branches diverge in a GPGPU program. In this paper, we propose the Thread Similarity Matrix (TSM) to visualize how branches diverge and in turn help find optimization opportunities. TSM contains an element for each pair of threads, representing the difference in code being executed by the pair of threads. The darker the element, the more similar the threads are, the lighter, the more dissimilar. TSM therefore allows GPGPU programmers to easily understand an application's branch divergence behavior and pinpoint performance anomalies. We present a case study to demonstrate how TSM can help optimize GPGPU programs: we improve the performance of a highly-optimized GPGPU kernel by 35% by reorganizing its thread organization to reduce its branch divergence rate.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126269474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xin Wang, Xiaofeng Ji, Yunping Lu, Yi Li, Weijia Zhou, Weihua Zhang, Wenyun Zhao
{"title":"Understanding the Architectural Characteristics of EDA Algorithms","authors":"Xin Wang, Xiaofeng Ji, Yunping Lu, Yi Li, Weijia Zhou, Weihua Zhang, Wenyun Zhao","doi":"10.1109/ICPP.2016.23","DOIUrl":"https://doi.org/10.1109/ICPP.2016.23","url":null,"abstract":"Currently, the release of different chip products has come to a burst. Time-to-market period of these products has been shortened to an extreme, nearly 8 to 12 months. To reduce production period, hardware architects try to shorten every design and manufacture stage. Therefore, it has become one of the major concerns for them that how to accelerate electronic design automation (EDA) tools, which have been widely used throughout the lifetime of chip design and manufacture. While many prior efforts have done in-depth works on different acceleration techniques, such as IC-based, FPGA-based, or GPUbased, to our best knowledge, there has been no systematic study towards the architectural characteristics analysis for these EDA algorithms. This may impede the further optimizations and acceleration for them. In this paper, we make the first attempt to construct an EDA benchmark suite (EDAbench for short) for architectural design, parallel acceleration, and system optimization. EDAbench covers representative modern EDA algorithms. We then evaluate predominant architectural characteristics from three aspects including computation characteristics, memory hierarchy, and systematic characteristics. Experimental results reveal that there are some vital gaps between existing hardware and the requirements of EDA algorithms. Based on the analysis, we also give out some insights and propose suggestions for future optimization, acceleration, and architecture design.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124690392","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jingna Zeng, J. Barreto, Seif Haridi, L. Rodrigues, P. Romano
{"title":"The Future(s) of Transactional Memory","authors":"Jingna Zeng, J. Barreto, Seif Haridi, L. Rodrigues, P. Romano","doi":"10.1109/ICPP.2016.57","DOIUrl":"https://doi.org/10.1109/ICPP.2016.57","url":null,"abstract":"This work investigates how to combine two powerful abstractions to manage concurrent programming: Transactional Memory (TM) and futures. The former hides from programmers the complexity of synchronizing concurrent access to shared data, via the familiar abstraction of atomic transactions. The latter serves to schedule and synchronize the parallel execution of computations whose results are not immediately required. While TM and futures are two widely investigated topics, the problem of how to exploit these two abstractions in synergy is still largely unexplored in the literature. This paper fills this gap by introducing Java Transactional Futures (JTF), a Java-based TM implementation that allows programmers to use futures to coordinate the execution of parallel tasks, while leveraging transactions to synchronize accesses to shared data. JTF provides a simple and intuitive semantic regarding the admissible serialization orders of the futures spawned by transactions, by ensuring that the results produced by a future are always consistent with those that one would obtain by executing the future sequentially. Our experimental results show that the use of futures in a TM allows not only to unlock parallelism within transactions, but also to reduce the cost of conflicts among top-level transactions in high contention workloads.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126003260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance Boosting Opportunities under Communication Imbalance in Power-Constrained HPC Clusters","authors":"Leonardo Piga, Indrani Paul, Wei Huang","doi":"10.1109/ICPP.2016.11","DOIUrl":"https://doi.org/10.1109/ICPP.2016.11","url":null,"abstract":"This paper provides a detailed message-passing interface (MPI) communication characterization across representative HPC applications. It further evaluates performance and power efficiency improvement opportunities. Specifically, it shows that the traditional approach of active polling while waiting for MPI messages is extremely power inefficient, especially under a constrained cluster-level power budget, where processors can only operate at some percentage of their labeled thermal design power (TDP) due to data center infrastructure limits. To mitigate the communication imbalance among different nodes, one can choose to power gate waiting processes and shift remaining power budget to processes that are in the critical execution paths, a technique we call Gate&Shift. With considerations of overheads from power gating and control-loop, Gate&Shift leads to performance improvement without additional power overhead. Gate&Shift is a reactive scheme that does not require prediction mechanisms. With the aid of real MPI traces and hardware measured power data from an HPC cluster, we show that (1) 1 ms control period for power-shifting is sufficient to achieve most potential performance gains, and (2) for a cluster with processors running at 65% of their labeled TDP, Gate&Shift can achieve 7%, 8.5% and 9% performance improvement for AMR Boxlib, Fill Boundary and Big FFT, respectively.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130996821","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shuang Song, Meng Li, Xinnian Zheng, Michael LeBeane, Jee Ho Ryoo, Reena Panda, A. Gerstlauer, L. John
{"title":"Proxy-Guided Load Balancing of Graph Processing Workloads on Heterogeneous Clusters","authors":"Shuang Song, Meng Li, Xinnian Zheng, Michael LeBeane, Jee Ho Ryoo, Reena Panda, A. Gerstlauer, L. John","doi":"10.1109/ICPP.2016.16","DOIUrl":"https://doi.org/10.1109/ICPP.2016.16","url":null,"abstract":"Big data decision-making techniques take advantage of large-scale data to extract important insights from them. One of the most important classes of such techniques falls in the domain of graph applications, where data segments and their inherent relationships are represented as vertices and edges. Efficiently processing large-scale graphs involves many subtle tradeoffs and is still regarded as an open-ended problem. Furthermore, as modern data centers move towards increased heterogeneity, the traditional assumption of homogeneous environments in current graph processing frameworks is no longer valid. Prior work estimates the graph processing power of heterogeneous machines by simply reading hardware configurations, which leads to suboptimal load balancing. In this paper, we propose a profiling methodology leveraging synthetic graphs for capturing a node's computational capability and guiding graph partitioning in heterogeneous environments with minimal overheads. We show that by sampling the execution of applications on synthetic graphs following a power-law distribution, the computing capabilities of heterogeneous clusters can be captured accurately (<;10% error). Our proxy-guided graph processing system results in a maximum speedup of 1.84x and 1.45x over a default system and prior work, respectively. On average, it achieves 17.9% performance improvement and 14.6% energy reduction as compared to prior heterogeneity-aware work.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131326754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"MobiSensing: Exploiting Human Mobility for Multi-application Mobile Data Sensing with Low User Intervention","authors":"Kang-Peng Chen, Haiying Shen","doi":"10.1109/ICPP.2016.63","DOIUrl":"https://doi.org/10.1109/ICPP.2016.63","url":null,"abstract":"The explosive growth of personal mobile devices (e.g., smartphones and pads) has brought about significant potential distributed sensing resources. However, such resources have not been fully utilized due to two problems: i) mobile device mobility usually is not dedicated to data sensing, and ii) users may not be willing to participate in the data sensing proactively, i.e., move to or wait in a specific area. To address these problems, we propose a sensing system, namely MobiSensing, with a low intervention to device owners. It uses the semi-Markov process to model node mobility for future mobility prediction. While moving around, mobile devices connect to the central task assignment server opportunistically through their owners' daily usage. In each connection, the server predicts the connected device's next connection and its mobility between current and the next connection. Then, the server assigns sensing tasks in this period of time that the node is likely to complete to the node. As a result, no proactive operations or movements are required for device owners, and sensing tasks can be completed passively and efficiently. Trace-driven experiments demonstrate the high successful rate of MobiSensing.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116262428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DC-Top-k: A Novel Top-k Selecting Algorithm and Its Parallelization","authors":"Z. Xue, Ruixuan Li, Heng Zhang, X. Gu, Zhiyong Xu","doi":"10.1109/ICPP.2016.49","DOIUrl":"https://doi.org/10.1109/ICPP.2016.49","url":null,"abstract":"Sorting is a basic computational task in Computer Science. As a variant of the sorting problem, top-k selecting have been widely used. To our knowledge, on average, the state-of-the-art top-k selecting algorithm Partial Quicksort takes C(n, k) = 2(n+1)Hn+2n-6k+6-2(n+3-k)Hn+1-k comparisons and about C(n, k)/6 exchanges to select the largest k terms from n terms, where Hn denotes the n-th harmonic number. In this paper, a novel top-k algorithm called DC-Top-k is proposed by employing a divide-and-conquer strategy. By a theoretical analysis, the algorithm is proved to be competitive with the state-of-the-art top-k algorithm on the compare time, with a significant improvement on the exchange time. On average, DC-Top-k takes at most (2-1/k)n+O(klog2k) comparisons and O(klog2k) exchanges to select the largest k terms from n terms. The effectiveness of the proposed algorithm is verified by a number of experiments which show that DC-Top-k is 1-3 times faster than Partial Quicksort and, moreover, is notably stabler than the latter. With an increase of k, it is also significantly more efficient than Min-heap based top-k algorithm (U. S. Patent, 2012). In the end, DC-Top-k is naturally implemented in a parallel computing environment, and a better scalability than Partial Quicksort is also demonstrated by experiments.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124850648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}