Adrián Castelló, Sangmin Seo, R. Mayo, P. Balaji, E. S. Quintana‐Ortí, Antonio J. Peña
{"title":"GLTO: On the Adequacy of Lightweight Thread Approaches for OpenMP Implementations","authors":"Adrián Castelló, Sangmin Seo, R. Mayo, P. Balaji, E. S. Quintana‐Ortí, Antonio J. Peña","doi":"10.1109/ICPP.2017.15","DOIUrl":"https://doi.org/10.1109/ICPP.2017.15","url":null,"abstract":"OpenMP is the de facto standard application programming interface (API) for on-node parallelism. The most popular OpenMP runtimes rely on POSIX threads (pthreads) implementations that offer an excellent performance for coarse-grained parallelism and match perfectly with the current hardware. However, a recent trend in runtimes/applications points in the direction of leveraging massive on-node parallelism in conjunction with fine-grained and dynamic scheduling paradigms. It has been demonstrated that lightweight thread (LWT) solutions are more appropriate for these new parallel paradigms. We have developed GLTO, an OpenMP implementation over the recently-emerged Generic Lightweight Threads (GLT) API. GLT exports a common API for LWT libraries that offers the possibility of running the same application over different native LWT solutions. In this paper we use GLTO to analyze different scenarios where OpenMP implementations may benefit from the use of either LWT or pthreads. Our study reveals that none of the threading approaches obtains the best performance in all the scenarios, but that there are important gaps among them.","PeriodicalId":392710,"journal":{"name":"2017 46th International Conference on Parallel Processing (ICPP)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126215013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiawen Sun, H. Vandierendonck, Dimitrios S. Nikolopoulos
{"title":"Accelerating Graph Analytics by Utilising the Memory Locality of Graph Partitioning","authors":"Jiawen Sun, H. Vandierendonck, Dimitrios S. Nikolopoulos","doi":"10.1109/ICPP.2017.27","DOIUrl":"https://doi.org/10.1109/ICPP.2017.27","url":null,"abstract":"This paper investigates how to improve the memory locality of graph-structured analytics on large-scale shared memory systems. We demonstrate that a graph partitioning where all in-edges for a vertex are placed in the same partition improves memory locality. However, realising performance improvement through such graph partitioning poses several challenges and requires rethinking the classification of graph algorithms and preferred data structures. We introduce the notion of medium dense frontiers, a type of frontier that is sufficiently dense for a bitmap representation, yet benefits from an indexed graph layout. Using three types of frontiers, and three graph layout schemes optimized to each frontier type, we design an edge traversal algorithm that autonomously decides which type to use. The distinction of forward vs. backward graph traversal folds into this decision and need no longer be specified by the programmer.We have implemented our techniques in a NUMA-aware graph analytics framework derived from Ligra and demonstrate a speedup of up to 4.34× over Ligra and up to 2.93× over Polymer.","PeriodicalId":392710,"journal":{"name":"2017 46th International Conference on Parallel Processing (ICPP)","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125166680","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Coflow-Based Co-Optimization Framework for High-Performance Data Analytics","authors":"Long Cheng, Ying Wang, Yulong Pei, D. Epema","doi":"10.1109/ICPP.2017.48","DOIUrl":"https://doi.org/10.1109/ICPP.2017.48","url":null,"abstract":"Efficient execution of distributed database operators such as joining and aggregating is critical for the performance of big data analytics. With the increase of the compute speedup of modern CPUs, reducing the network communication time of these operators in large systems is becoming increasingly important, and also challenging current techniques. Significant performance improvements have been achieved by using state-of-the-art methods, such as reducing network traffic designed in the data management domain, and data flow scheduling in the data communications domain. However, the proposed techniques in both fields just view each other as a black box, and performance gains from a co-optimization perspective have not yet been explored.In this paper, based on current research in coflow scheduling, we propose a novel Coflow-based Co-optimization Framework (CCF), which can co-optimize application-level data movement and network-level data communications for distributed operators, and consequently contribute to their performance in large distributed environments. We present the detailed design and implementation of CCF, and conduct an experimental evaluation of CCF using large-scale simulations on large data joins. Our results demonstrate that CCF can always perform faster than current approaches on network communications in large-scale distributed scenarios.","PeriodicalId":392710,"journal":{"name":"2017 46th International Conference on Parallel Processing (ICPP)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121024696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Constrained Tensor Factorization with Accelerated AO-ADMM","authors":"Shaden Smith, Alec Beri, G. Karypis","doi":"10.1109/ICPP.2017.20","DOIUrl":"https://doi.org/10.1109/ICPP.2017.20","url":null,"abstract":"Low-rank sparse tensor factorization is a populartool for analyzing multi-way data and is used in domainssuch as recommender systems, precision healthcare, and cybersecurity.Imposing constraints on a factorization, such asnon-negativity or sparsity, is a natural way of encoding priorknowledge of the multi-way data. While constrained factorizationsare useful for practitioners, they can greatly increasefactorization time due to slower convergence and computationaloverheads. Recently, a hybrid of alternating optimization andalternating direction method of multipliers (AO-ADMM) wasshown to have both a high convergence rate and the ability tonaturally incorporate a variety of popular constraints. In thiswork, we present a parallelization strategy and two approachesfor accelerating AO-ADMM. By redefining the convergencecriteria of the inner ADMM iterations, we are able to splitthe data in a way that not only accelerates the per-iterationconvergence, but also speeds up the execution of the ADMMiterations due to efficient use of cache resources. Secondly,we develop a method of exploiting dynamic sparsity in thefactors to speed up tensor-matrix kernels. These combinedadvancements achieve up to 8 speedup over the state-of-the art on a variety of real-world sparse tensors.","PeriodicalId":392710,"journal":{"name":"2017 46th International Conference on Parallel Processing (ICPP)","volume":"81 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126887651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ryota Yasudo, M. Koibuchi, K. Nakano, Hiroki Matsutani, H. Amano
{"title":"Order/Radix Problem: Towards Low End-to-End Latency Interconnection Networks","authors":"Ryota Yasudo, M. Koibuchi, K. Nakano, Hiroki Matsutani, H. Amano","doi":"10.1109/ICPP.2017.41","DOIUrl":"https://doi.org/10.1109/ICPP.2017.41","url":null,"abstract":"We introduce a novel graph called a host-switch graph, which consists of host vertices and switch vertices. Using host-switch graphs, we formulate a graph problem called an order/radix problem (ORP) for designing low end-to-end latency interconnection networks. Our focus is on reducing the host-to-host average shortest path length (h-ASPL), since the shortest path length between hosts in a host-switch graph corresponds to the end-to-end latency of a network. We hence define ORP as follows: given order (the number of hosts) and radix (the number of ports per switch), find a host-switch graph with the minimum h-ASPL. We demonstrate that the optimal number of switches can mathematically be predicted. On the basis of the prediction, we carry out a randomized algorithm to find a host-switch graph with the minimum h-ASPL. Interestingly, our solutions include a host-switch graph such that switches have the different number of hosts. We then apply host-switch graphs to interconnection networks and evaluate them practically. As compared with the three conventional interconnection networks (the torus, the dragonfly, and the fat-tree), we demonstrate that our networks provide higher performance while the number of switches can decrease.","PeriodicalId":392710,"journal":{"name":"2017 46th International Conference on Parallel Processing (ICPP)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127120589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High Performance Query Processing for Web Scale RDF Data using BSP Style Communication and Balanced Distribution","authors":"Minho Bae, Junho Eum, Donghoon Kim, Sangyoon Oh","doi":"10.1109/ICPP.2017.29","DOIUrl":"https://doi.org/10.1109/ICPP.2017.29","url":null,"abstract":"To overcome scalability and performance issues for process queries over a web-scale RDF data, various studies have proposed RDF SPARQL query processing algorithm using parallel processing manners. However, it is hard to resolve the scalability and performance issues together because the problem of communication overhead between nodes is closely related to the data distribution for parallel processing. For efficient RDF query parallel processing, it is essential to distribute and process data evenly while reducing communication overhead. In this paper, we propose RDF query parallel processing algorithms with RDF data partitioning technique to guarantee evenly distributed data over the cluster. We also propose our in-memory RDF query processing system as a form of Bulk Synchronization Parallel system to reduce network overhead. Our empirical evaluation results show that the proposed system outperforms a popular RDF-3X on LUBM benchmark and UniProt queries from 2.20 to 43.08 times. Especially, the effectiveness of the system improves significantly when the SPARQL queries are complex with high input and select.","PeriodicalId":392710,"journal":{"name":"2017 46th International Conference on Parallel Processing (ICPP)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129076220","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Data Sharing on Heterogeneous Systems","authors":"Victor Garcia-Flores, E. Ayguadé, Antonio J. Peña","doi":"10.1109/ICPP.2017.21","DOIUrl":"https://doi.org/10.1109/ICPP.2017.21","url":null,"abstract":"General-purpose computing on GPUs has become more accessible due to features such as shared virtual memory and demand paging. Unfortunately it comes at a price, and that is performance. Automatic memory management is convenient but suffers from many drawbacks, preventing heterogeneous systems from achieving their full potential. In this work we analyze the challenges and inefficiencies of demand paging in GPUs, in particular on collaborative computations where data migrates multiple times between host and device. We establish that demand paging on GPUs introduces significant overheads for these kind of computations, and identify the issues of false sharing and unnecessary data transfers derived from the granularity at which data is migrated. In order to alleviate these problems we propose a memory organization and dynamic migration scheme to efficiently share data between host and device at fine granularities and without software intervention. We evaluate our design with a set of collaborative heterogeneous benchmarks and find it achieves 15% lower execution times on average with cache line-sized migrations, but severely degrading performance on benchmarks that access large blocks of contiguous memory. Page-sized migrations, although inefficient, provide on average a 47% execution time reduction with our design over a baseline system implementing demand paging. Our results suggest that cache line-sized migrations are not feasible in systems using a PCI-Express interconnect. In order to understand how future interconnect technologies will impact the feasibility of fine-grained migrations, we evaluate our scheme with various link latencies. We find interconnect latencies four to five times lower than PCI-Express are sufficient to effectively share data at finer granularities.","PeriodicalId":392710,"journal":{"name":"2017 46th International Conference on Parallel Processing (ICPP)","volume":"180 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131703195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards Highly Efficient DGEMM on the Emerging SW26010 Many-Core Processor","authors":"Lijuan Jiang, Chao Yang, Yulong Ao, Wanwang Yin, Wenjing Ma, Qiao Sun, Fangfang Liu, Rongfen Lin, P. Zhang","doi":"10.1109/ICPP.2017.51","DOIUrl":"https://doi.org/10.1109/ICPP.2017.51","url":null,"abstract":"The matrix-matrix multiplication is an essential building block that can be found in various scientific and engineering applications. High-performance implementations of the matrix-matrix multiplication on state-of-the-art processors may be of great importance for both the vendors and the users. In this paper, we present a detailed methodology of implementing and optimizing the double-precision general format matrix-matrix multiplication (DGEMM) kernel on the emerging SW26010 processor, which is used to build the Sunway TaihuLight supercomputer. We propose a three level blocking algorithm to orchestrate data on the memory hierarchy and expose parallelism on different hardware levels, and design a collective data sharing scheme by using the register communication mechanism to exchange data efficiently among different cores. On top of those, further optimizations are done based on a data-thread mapping method for efficient data distribution, a double buffering scheme for asynchronous DMA data transfer, and an instruction scheduling method for maximizing the pipeline usage. Experiment results show that the proposed DGEMM implementation can fully exploit the unique hardware features provided by SW26010 and can sustain up to 95% of the peak performance.","PeriodicalId":392710,"journal":{"name":"2017 46th International Conference on Parallel Processing (ICPP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130833548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jordyn C. Maglalang, S. Krishnamoorthy, Kunal Agrawal
{"title":"Locality-Aware Dynamic Task Graph Scheduling","authors":"Jordyn C. Maglalang, S. Krishnamoorthy, Kunal Agrawal","doi":"10.1109/ICPP.2017.16","DOIUrl":"https://doi.org/10.1109/ICPP.2017.16","url":null,"abstract":"Dynamic task graph schedulers automatically balance work across processor cores by scheduling tasks among available threads while preserving dependences. In this paper, we design NABBITC, a provably efficient dynamic task graph scheduler that accounts for data locality on NUMA systems. NABBITC allows users to assign a color to each task representing the location (e.g., a processor core) that has the most efficient access to data needed during that node's execution. NABBITC then automatically adjusts the scheduling so as to preferentially execute each node at the location that matches its color—leading to better locality because the node is likely to make local rather than remote accesses. At the same time, NABBITC tries to optimize load balance and not add too much overhead compared to the vanilla NABBIT scheduler that does not consider locality. We provide a theoretical analysis that shows that NABBITC does not asymptotically impact the scalability of NABBIT.We evaluated the performance of NABBITC on a suite of benchmarks, including both memory and compute intensive applications. Our experiments indicate that adding locality awareness has a considerable performance advantage compared to the vanilla NABBIT scheduler. Furthermore, we compared NABBITC to both OpenMP tasks and OpenMP loops. For regular applications, OpenMP loops can achieve perfect locality and perfect load balance statically. For these benchmarks, NABBITC has a small performance penalty compared to OpenMP due to its dynamic scheduling strategy. Similarly, for compute intensive applications with course-grained tasks, OpenMP task's centralized scheduler provides the best performance. However, we find that NABBITC provides a good trade-off between data locality and load balance; on memory intensive jobs, it consistently outperforms OpenMP tasks while for irregular jobs where load balancing is important, it outperforms OpenMP loops. Therefore, NABBITC combines the benefits of locality-aware scheduling for regular, memory intensive, applications (the forte of static schedulers such as those in OpenMP) and dynamically adapting to load imbalance in irregular applications (the forte of dynamic schedulers such as Cilk Plus, TBB, and Nabbit).","PeriodicalId":392710,"journal":{"name":"2017 46th International Conference on Parallel Processing (ICPP)","volume":"153 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132670103","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CELIA: Cost-Time Performance of Elastic Applications on Cloud","authors":"Sunimal Rathnayake, Dumitrel Loghin, Y. M. Teo","doi":"10.1109/ICPP.2017.43","DOIUrl":"https://doi.org/10.1109/ICPP.2017.43","url":null,"abstract":"Clouds offer great flexibility for scaling applications due to the wide spectrum of resources with different cost-performance, inherent resource elasticity and pay-peruse charging. However, determining cost-time-efficient cloud configurations to execute a given application in the large resource configuration space remains a key challenge. The growing importance of elastic applications for which the accuracy is a function of resource consumption introduces new opportunities to exploit resource elasticity on clouds. In this paper, we introduce CELIA, a measurement-driven analytical modeling approach to determine cost-time-optimal cloud resource configurations to execute a given elastic application with a time deadline and a cost budget. We evaluate CELIA with three representative elastic applications on more than ten million configurations consisting of Amazon EC2 resource types with different cost-performance. Using CELIA, we show that multiple cost-time Pareto-optimal configurations exist among feasible cloud configurations that execute an elastic application within a time deadline and cost budget. These Pareto-optimal configurations exhibit up to 30% cost savings for an elastic application representing n-body simulation. We investigate the impact of fixed-time scaling on the cost of executing elastic applications on cloud. We show that cost gradient with respect to resource demand is smaller when cloud resources with better cost-performance are used. Furthermore, we show that the relative increase in cost is always smaller compared to the relative reduction of execution time deadline. For example, tightening the execution time deadline by two-thirds incurs only 40% increase in cost for the n-body simulation application.","PeriodicalId":392710,"journal":{"name":"2017 46th International Conference on Parallel Processing (ICPP)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125931540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}