J. Daily, Abhinav Vishnu, B. Palmer, H. V. Dam, D. Kerbyson
{"title":"On the suitability of MPI as a PGAS runtime","authors":"J. Daily, Abhinav Vishnu, B. Palmer, H. V. Dam, D. Kerbyson","doi":"10.1109/HiPC.2014.7116712","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116712","url":null,"abstract":"Partitioned Global Address Space (PGAS) models are emerging as a popular alternative to MPI models for designing scalable applications. At the same time, MPI remains a ubiquitous communication subsystem due to its standardization, high performance, and availability on leading platforms. In this paper, we explore the suitability of using MPI as a scalable PGAS communication subsystem. We focus on the Remote Memory Access (RMA) communication in PGAS models which typically includes get, put, and atomic memory operations. We perform an in-depth exploration of design alternatives based on MPI. These alternatives include using a semantically-matching interface such as MPI-RMA, as well as not-so-intuitive interfaces such as MPI two-sided with a combination of multi-threading and dynamic process management. With an in-depth exploration of these alternatives and their shortcomings, we propose a novel design which is facilitated by the data-centric view in PGAS models. This design leverages a combination of highly tuned MPI two-sided semantics and an automatic, user-transparent split of MPI communicators to provide asynchronous progress. We implement the asynchronous progress ranks approach and other approaches within the Communication Runtime for Exascale which is a communication subsystem for Global Arrays. Our performance evaluation spans pure communication benchmarks, graph community detection and sparse matrix-vector multiplication kernels, and a computational chemistry application. The utility of our proposed PR-based approach is demonstrated by a 2.17x speedup on 1008 processors over the other MPI-based designs.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126354190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Yébenes, J. Escudero-Sahuquillo, Crispín Gómez Requena, P. García, F. J. Alfaro, F. Quiles, J. Duato
{"title":"Combining HoL-blocking avoidance and differentiated services in high-speed interconnects","authors":"P. Yébenes, J. Escudero-Sahuquillo, Crispín Gómez Requena, P. García, F. J. Alfaro, F. Quiles, J. Duato","doi":"10.1109/HiPC.2014.7116874","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116874","url":null,"abstract":"Current high-performance platforms such as Datacenters or High-Performance Computing systems rely on highspeed interconnection networks able to cope with the ever-increasing communication requirements of modern applications. In particular, in high-performance systems that must offer differentiated services to applications which involve traffic prioritization, it is almost mandatory that the interconnection network provides some type of Quality-of-Service (QoS) and Congestion-Management mechanism in order to achieve the required network performance. Most current QoS and Congestion-Management mechanisms for high-speed interconnects are based on using the same kind of resources, but with different criteria, resulting in disjoint types of mechanisms. By contrast, we propose in this paper a novel, straightforward solution that leverages the resources already available in InfiniBand components (basically Service Levels and Virtual Lanes) to provide both QoS and Congestion Management at the same time. This proposal is called CHADS (Combined HoL-blocking Avoidance and Differentiated Services), and it could be applied to any network topology. From the results shown in this paper for networks configured with the novel, cost-efficient KNS hybrid topology, we can conclude that CHADS is more efficient than other schemes in reducing the interferences among packet flows that have the same or different priorities.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"391 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116664449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An improved recursive graph bipartitioning algorithm for well balanced domain decomposition","authors":"Astrid Casadei, P. Ramet, J. Roman","doi":"10.1109/HiPC.2014.7116878","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116878","url":null,"abstract":"In the context of hybrid sparse linear solvers based on domain decomposition and Schur complement approaches, getting a domain decomposition tool leading to a good balancing of both the internal node set size and the interface node set size for all the domains is a critical point for load balancing and efficiency issues in a parallel computation context. For this purpose, we revisit the original algorithm introduced by Lipton, Rose and Tarjan [1] in 1979 which performed the recursion for nested dissection in a particular manner. From this specific recursive strategy, we propose in this paper several variations of the existing algorithms in the multilevel Scotch partitioner that take into account these multiple criteria and we illustrate the improved results on a collection of graphs corresponding to finite element meshes used in numerical scientific applications.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"146 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122610144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GPU parallelization of the stochastic on-time arrival problem","authors":"Maleen Abeydeera, S. Samaranayake","doi":"10.1109/HiPC.2014.7116896","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116896","url":null,"abstract":"The Stochastic On-Time Arrival (SOTA) problem has recently been studied as an alternative to traditional shortest-path formulations in situations with hard deadlines. The goal is to find a routing strategy that maximizes the probability of reaching the destination within a pre-specified time budget, with the edge weights of the graph being random variables with arbitrary distributions. While this is a practically useful formulation for vehicle routing, the commercial deployment of such methods is not currently feasible due to the high computational complexity of existing solutions. We present a parallelization strategy for improving the computation times by multiple orders of magnitude compared to the single threaded CPU implementations, using a CUDA GPU implementation. A single order of magnitude is achieved via naive parallelization of the problem, and another order of magnitude via optimal utilization of the GPU resources. We also show that the runtime can be further reduced in certain cases using dynamic thread assignment and an edge clustering method for accelerating queries with a small time budget.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133435200","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DRIVE: Using implicit caching hints to achieve disk I/O reduction in virtualized environments","authors":"Sujesha Sudevalayam, Purushottam Kulkarni","doi":"10.1109/HiPC.2014.7116877","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116877","url":null,"abstract":"Co-hosting of virtualized applications results in similar content across multiple blocks on disk, which are fetched into memory (the host's page cache). Content similarity can be harnessed both to avoid duplicate disk I/O requests that fetch the same content repeatedly, as well as to prevent multiple occurrences of duplicate content in cache. Typically, caches store the most recently or frequently accessed blocks to reduce the number of disk read accesses. These caches are referenced by block number, and can not recognize content similarity across multiple blocks. Existing work in memory deduplication merges cache pages after multiple identical blocks have already been fetched from disk into cache, while existing work in I/O deduplication reserves a portion of the host-cache to be maintained as a content-aware cache. We propose a disk I/O reduction system for the virtualization environment that addresses the dual problems of duplicate I/O and duplicate content in the host-cache, without being invasive. We build a disk read-access optimization called DRIVE, that identifies content similarity across multiple blocks, and performs hint-based read I/O redirection to improve cache effectiveness, thus reducing the number of disk reads further. A metadata store is maintained based on the virtual machine's disk accesses and implicit caching hints are collected for future read I/O redirection. The read I/O redirection is performed from within the virtual block device in the virtualized system, to manipulate the entire host-cache as a content-deduplicated cache implicitly. Our trace-based evaluation using a custom simulator, reveals that DRIVE always performs equal to or better than the Vanilla system, achieving up to 20% better cache-hit ratios and reducing the number of disk reads by up to 80%. The results also indicate that our system is able to achieve up to 97% content deduplication in the host-cache.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123540867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rohan Bhalla, Prathmesh Kallurkar, Nitin Gupta, S. Sarangi
{"title":"TriKon: A hypervisor aware manycore processor","authors":"Rohan Bhalla, Prathmesh Kallurkar, Nitin Gupta, S. Sarangi","doi":"10.1109/HiPC.2014.7116710","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116710","url":null,"abstract":"Virtualization is increasingly being deployed to run applications in a cloud computing environment. Sadly, there are overheads associated with hypervisors that can prohibitively reduce application performance. A major source of the overheads is the destructive interference between the application, OS, and hypervisor in the memory system. We characterize such overheads in this paper, and propose the design of a novel Triangle cache that can effectively mitigate destructive interference across these three classes of workloads. We subsequently, proceed to design the TriKon manycore processor that consists of a set of heterogeneous cores with caches of different sizes, and Triangle caches. To maximize the throughput of the system as a whole, we propose a dynamic scheduling algorithm for scheduling a class of system and CPU intensive applications on the set of heterogeneous cores. The area of the TriKon processor is within 2% of a baseline processor, and with such a system, we could achieve a performance gain of 12% for a suite of benchmarks. Within this suite, the system intensive benchmarks show a performance gain of 20% while the performance of the compute intensive ones remains unaffected. Also, by allocating extra area for cores with sophisticated cache designs, we further improved the performance of the system intensive benchmarks to 30%.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129384944","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An early experience of regional ocean modelling on intel many integrated core architecture","authors":"Srikanth Yalavarthi, A. Kaginalkar","doi":"10.1109/HiPC.2014.7116907","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116907","url":null,"abstract":"Ocean modelling is an inherently complex phenomenon within the earth system framework which poses a challenge to the earth and computational scientists. Simulation of wide temporal and spatial scales in real-time / near real-time necessitates computational scientists to explore new performance enhancing architectures and simulation methods. Due to the spectra of the scales of motion, the computational requirements of ocean forecasting are relatively higher than that of numerical weather prediction. To some extent, the rapidly evolving computer technology provide solutions to this. In this paper, we present initial attempts in porting a high resolution regional ocean model on an Intel MIC based hybrid system. We discuss the challenges and issues to be addressed in achieving an efficient implementation of an ocean modelling system.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127050264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GpuTejas: A parallel simulator for GPU architectures","authors":"Geetika Malhotra, Seep Goel, S. Sarangi","doi":"10.1109/HiPC.2014.7116897","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116897","url":null,"abstract":"In this paper, we introduce a new Java-based parallel GPGPU simulator, GpuTejas. GpuTejas is a fast trace driven simulator, which uses relaxed synchronization, and non-blocking data structures to derive its speedups. Secondly, it introduces a novel scheduling and partitioning scheme for parallelizing a GPU simulator. We evaluate the performance of our simulator with a set of Rodinia benchmarks. We demonstrate a mean speedup of 17.33x with 64 threads over sequential execution, and a speedup of 429X over the widely used simulator GPGPU-Sim. We validated our timing and simulation model by comparing our results with a native system (NVIDIA Tesla M2070). As compared to the sequential version of GpuTejas, the parallel version has an error limited to <;7.67% for our suite of benchmarks, which is similar to the numbers reported by competing parallel simulators.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"74 8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127181559","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Software based ultrasound B-mode/beamforming optimization on GPU and its performance prediction","authors":"T. Phuong, Jeong-Gun Lee","doi":"10.1109/HiPC.2014.7116911","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116911","url":null,"abstract":"In the paper, we design and optimize an ultrasound B-mode imaging including a high-computationally demanding beamformer on a commercial GPU. For the performance optimization, we explore the design space spanned with the use of different memory types, instruction scheduling and thread mapping strategies, etc. Then, with the developed B-mode imaging code, we conduct performance evaluations on various GPUs having different architectural features (e.g., the number of cores and core frequency). Through the experiments on various different GPU devices, we search “performance-significant-factors” which are hardware features of affecting B-mode imaging performance. Then, the analytical relationship between these GPU architectural design factors and the B-mode imaging performance is derived for our target application. At the commercial aspect of developing a product, we can select GPU architectures which are best suitable for the ultrasound applications through the prediction model. In the future, using the predictions, it would be also possible to customize a “cost-minimal” GPU architecture which satisfies a given performance constraint. In addition, the prediction model can be used to dynamically control the activity of GPU components according to the temporal requirement of performance and power/energy consumptions in portable ultrasound diagnosis systems.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122528109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A multilevel compressed sparse row format for efficient sparse computations on multicore processors","authors":"H. Kabir, J. Booth, P. Raghavan","doi":"10.1109/HiPC.2014.7116882","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116882","url":null,"abstract":"We seek to improve the performance of sparse matrix computations on multicore processors with non-uniform memory access (NUMA). Typical implementations use a bandwidth reducing ordering of the matrix to increase locality of accesses with a compressed storage format to store and operate only on the non-zero values. We propose a new multilevel storage format and a companion ordering scheme as an explicit adaptation to map to NUMA hierarchies. More specifically, we propose CSR-k, a multilevel form of the popular compressed sparse row (CSR) format for a multicore processor with k > 1 well-differentiated levels in the memory subsystem. Additionally, we develop Band-k, a modified form of a traditional bandwidth reduction scheme, to convert a matrix represented in CSRto our proposed CSR-k. We evaluate the performance of the widely-used and important sparse matrix-vector multiplication (SpMV) kernel using CSR-2 on Intel Westmere processors for a test suite of 12 large sparse matrices with row densities in the range 3 to 45. On 32 cores, on average across all matrices in the test suite, the execution time for SpMV with CSR-2is less than 42% of the time taken by the state-of-the-art automatically tuned SpMV resulting in energy savings of approximately 56%. Additionally, on average, the parallel speed-up on 32 cores of the automatically tuned SpMV relative to its 1-core performance is 8.18 compared to a value of 19.71 for CSR-2. Our analysis indicates that the higher performance of SpMV with CSR-2 comes from achieving higher reuse of x in the shared L3 cache without incurring overheads from fill-in of original zeroes. Furthermore, the pre-processing costs of SpMV with CSR-2 can be amortized on average over 97 iterations of SpMV using CSR and are substantially lower than the 513 iterations required for the automatically tuned implementation. Based on these results, CSR-k appears to be a promising multilevel formulation of CSR for adapting sparse computations to multicore processors with NUMA memory hierarchies.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125857279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}