ACM SIGPLAN Symposium on Scala最新文献

筛选
英文 中文
Using HPX and LibGeoDecomp for scaling HPC applications on heterogeneous supercomputers 使用HPX和LibGeoDecomp在异构超级计算机上扩展HPC应用程序
ACM SIGPLAN Symposium on Scala Pub Date : 2013-11-17 DOI: 10.1145/2530268.2530269
T. Heller, Hartmut Kaiser, Andreas Schäfer, D. Fey
{"title":"Using HPX and LibGeoDecomp for scaling HPC applications on heterogeneous supercomputers","authors":"T. Heller, Hartmut Kaiser, Andreas Schäfer, D. Fey","doi":"10.1145/2530268.2530269","DOIUrl":"https://doi.org/10.1145/2530268.2530269","url":null,"abstract":"With the general availability of PetaFLOP clusters and the advent of heterogeneous machines equipped with special accelerator cards such as the Xeon Phi[2], computer scientist face the difficult task of improving application scalability beyond what is possible with conventional techniques and programming models today. In addition, the need for highly adaptive runtime algorithms and for applications handling highly inhomogeneous data further impedes our ability to efficiently write code which performs and scales well.\u0000 In this paper we present the advantages of using HPX[19, 3, 29], a general purpose parallel runtime system for applications of any scale as a backend for LibGeoDecomp[25] for implementing a three-dimensional N-Body simulation with local interactions. We compare scaling and performance results for this application while using the HPX and MPI backends for LibGeoDecomp. LibGeoDecomp is a Library for Geometric Decomposition codes implementing the idea of a user supplied simulation model, where the library handles the spatial and temporal loops, and the data storage.\u0000 The presented results are acquired from various homogeneous and heterogeneous runs including up to 1024 nodes (16384 conventional cores) combined with up to 16 Xeon Phi accelerators (3856 hardware threads) on TACC's Stampede supercomputer[1]. In the configuration using the HPX backend, more than 0.35 PFLOPS have been achieved, which corresponds to a parallel application efficiency of around 79%. Our measurements demonstrate the advantage of using the intrinsically asynchronous and message driven programming model exposed by HPX which enables better latency hiding, fine to medium grain parallelism, and constraint based synchronization. HPX's uniform programming model simplifies writing highly parallel code for heterogeneous resources.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123234225","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
On scalability behaviour of Monte Carlo sparse approximate inverse for matrix computations 矩阵计算中蒙特卡罗稀疏近似逆的可扩展性
ACM SIGPLAN Symposium on Scala Pub Date : 2013-11-17 DOI: 10.1145/2530268.2530274
J. Strassburg, V. Alexandrov
{"title":"On scalability behaviour of Monte Carlo sparse approximate inverse for matrix computations","authors":"J. Strassburg, V. Alexandrov","doi":"10.1145/2530268.2530274","DOIUrl":"https://doi.org/10.1145/2530268.2530274","url":null,"abstract":"This paper presents a Monte Carlo SPAI pre-conditioner. In contrast to the standard deterministic SPAI pre-conditioners that use the Frobenius norm, a Monte Carlo alternative that relies on the use of Markov Chain Monte Carlo (MCMC) methods to compute a rough matrix inverse (MI) is given. Monte Carlo methods enable a quick rough estimate of the non-zero elements of the inverse matrix with a given precision and certain probability. The advantage of this method is that the same approach is applied to sparse and dense matrices and that complexity of the Monte Carlo matrix inversion is linear of the size of the matrix. The behaviour of the proposed algorithm is studied, its performance is investigated and a comparison with the standard deterministic SPAI, as well as the optimized and parallel MSPAI version is made. Further Monte Carlo SPAI and MSPAI are used for solving systems of linear algebraic equations (SLAE) using BiCGSTAB and a comparison of the results is made.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124172162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Robust distributed orthogonalization based on randomized aggregation 基于随机聚合的鲁棒分布式正交化
ACM SIGPLAN Symposium on Scala Pub Date : 2011-11-14 DOI: 10.1145/2133173.2133177
W. Gansterer, Gerhard Niederbrucker, H. Straková, Stefan Schulze Grotthoff
{"title":"Robust distributed orthogonalization based on randomized aggregation","authors":"W. Gansterer, Gerhard Niederbrucker, H. Straková, Stefan Schulze Grotthoff","doi":"10.1145/2133173.2133177","DOIUrl":"https://doi.org/10.1145/2133173.2133177","url":null,"abstract":"The construction of distributed algorithms for matrix computations built on top of distributed data aggregation algorithms with randomized communication schedules is investigated. For this purpose, a new aggregation algorithm for summing or averaging distributed values, the push-flow algorithm, is developed, which achieves superior resilience properties with respect to node failures compared to existing aggregation methods. On a hypercube topology it asymptotically requires the same number of iterations as the optimal all-to-all reduction operation and it scales well with the number of nodes. Orthogonalization is studied as a prototypical matrix computation task. A new fault tolerant distributed orthogonalization method (rdmGS), which can produce accurate results even in the presence of node failures, is built on top of distributed data aggregation algorithms.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"4290 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133388177","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
On non-blocking collectives in 3D FFTs 关于3D fft中的非阻塞集体
ACM SIGPLAN Symposium on Scala Pub Date : 2011-11-14 DOI: 10.1145/2133173.2133180
R. S. Saksena
{"title":"On non-blocking collectives in 3D FFTs","authors":"R. S. Saksena","doi":"10.1145/2133173.2133180","DOIUrl":"https://doi.org/10.1145/2133173.2133180","url":null,"abstract":"With the inclusion of non-blocking global collective operations in the MPI 3.0 draft specification many fundamental algorithms such as those for performing 3-dimensional (3D) FFTs will be modified to take advantage of non-blocking collectives. Novel modifications to such fundamental algorithms will need to be suitable for incorporation in general-purpose FFT libraries to be routinely used by HPC application users. Here we present such a general-purpose algorithmic strategy to utilize non-blocking collective communications in the calculation of a single parallel 3D FFT. In this scheme, the global collective communication is partitioned into blocking and non-blocking components such that overlap between communication and computation is obtained in the 3D FFT calculation. We present benchmarks of our scheme for overlapping computation and communication in the calculation of single variable 3D FFTs on two different architectures (a) HECToR, a Cray XE6 machine and (b) a Fujitsu PRIMERGY Intel Westmere cluster with InfiniBand interconnect.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131380313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
The low-power architecture approach towards exascale computing 面向百亿亿次计算的低功耗架构方法
ACM SIGPLAN Symposium on Scala Pub Date : 2011-11-14 DOI: 10.1145/2133173.2133175
Nikola Rajovic, Nikola Puzovic, L. Vilanova, Carlos Villavieja, Alex Ramírez
{"title":"The low-power architecture approach towards exascale computing","authors":"Nikola Rajovic, Nikola Puzovic, L. Vilanova, Carlos Villavieja, Alex Ramírez","doi":"10.1145/2133173.2133175","DOIUrl":"https://doi.org/10.1145/2133173.2133175","url":null,"abstract":"Energy efficiency is a first-order concern when deploying any computer system. From battery-operated mobile devices, to data centers and supercomputers, energy consumption limits the performance that can be offered.\u0000 We are exploring an alternative to current supercomputers that builds on the small energy-efficient mobile processors. We present results from the prototype system based on ARM Cortex-A9 and make projections about the possibilities to increase energy efficiency.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123793422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 82
Soft error resilient QR factorization for hybrid system with GPGPU GPGPU混合系统的软误差弹性QR分解
ACM SIGPLAN Symposium on Scala Pub Date : 2011-11-14 DOI: 10.1145/2133173.2133179
Peng Du, P. Luszczek, S. Tomov, J. Dongarra
{"title":"Soft error resilient QR factorization for hybrid system with GPGPU","authors":"Peng Du, P. Luszczek, S. Tomov, J. Dongarra","doi":"10.1145/2133173.2133179","DOIUrl":"https://doi.org/10.1145/2133173.2133179","url":null,"abstract":"The general purpose graphics processing units (GPGPU) are increasingly deployed for scientific computing due to their performance advantages over CPUs. As a result, fault tolerance has become a more serious concern compared to the period when GPGPUs were used exclusively for graphics applications. Using GPUs and CPUs together in a hybrid computing system increases flexibility and performance but also increases the possibility of the computations being affected by soft errors. In this work, we propose a soft error resilient algorithm for QR factorization on such hybrid systems. Our contributions include (1) a checkpointing and recovery mechanism for the left-factor Q whose performance is scalable on hybrid systems; (2) optimized Givens rotation utilities on GPGPUs to efficiently reduce an upper Hessenberg matrix to an upper triangular form for the protection of the right factor R, and (3) a recovery algorithm based on QR update on GPGPUs. Experimental results show that our fault tolerant QR factorization can success- fully detect and recover from soft errors in the entire matrix with little overhead on hybrid systems with GPGPUs.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121355153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
Performance analysis of a cardiac simulation code using IPM 使用IPM的心脏模拟代码的性能分析
ACM SIGPLAN Symposium on Scala Pub Date : 2011-11-14 DOI: 10.1145/2133173.2133186
P. Strazdins, M. Hegland
{"title":"Performance analysis of a cardiac simulation code using IPM","authors":"P. Strazdins, M. Hegland","doi":"10.1145/2133173.2133186","DOIUrl":"https://doi.org/10.1145/2133173.2133186","url":null,"abstract":"This paper details our experiences in performing a detailed performance analysis on a large-scale parallel cardiac simulation by the Chaste software on an Nehalem and Infiniband-based cluster. Our methodology achieves good accuracy for relatively modest amounts of cluster time. The use of sections in the Chaste internal profiler, coupled with the IPM tool, enabled some detailed insights into the performance and scalability of the application.\u0000 For large core counts, our analysis showed that performance was no longer dominated by the linear systems solver. The computationally-intensive components scaled well up to 2048 cores, and poorly scaling and highly imbalanced components associated with program output and miscellaneous functions were limiting scalability.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"PP 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115520935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Fault tolerant matrix-matrix multiplication: correcting soft errors on-line 容错矩阵-矩阵乘法:在线修正软错误
ACM SIGPLAN Symposium on Scala Pub Date : 2011-11-14 DOI: 10.1145/2133173.2133185
Panruo Wu, Chong Ding, Longxiang Chen, Feng Gao, T. Davies, Christer Karlsson, Zizhong Chen
{"title":"Fault tolerant matrix-matrix multiplication: correcting soft errors on-line","authors":"Panruo Wu, Chong Ding, Longxiang Chen, Feng Gao, T. Davies, Christer Karlsson, Zizhong Chen","doi":"10.1145/2133173.2133185","DOIUrl":"https://doi.org/10.1145/2133173.2133185","url":null,"abstract":"Soft errors are one-time events that corrupt the state of a computing system but not its overall functionality. Soft errors normally do not interrupt the execution of the affected program, but the affected computation results can not be trusted any more. A well known technique to correct soft errors in matrix-matrix multiplication is algorithm-based fault tolerance (ABFT). While ABFT achieves much better efficiency than triple modular redundancy (TMR) - a traditional general technique to correct soft errors, both ABFT and TMR detect errors off-line after the computation is finished. This paper extends the traditional ABFT technique from off-line to on-line so that soft errors in matrix-matrix multiplication can be detect in the middle of the computation during the program execution and higher efficiency can be achieved by correcting the corrupted computations in a timely manner. Experimental results demonstrate that the proposed technique can correct one error every ten seconds with negligible (i.e., less than 1%) performance penalty over the ATLAS dgemm().","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114319460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
Layout-aware scientific computing: a case study using MILC 感知布局的科学计算:使用MILC的案例研究
ACM SIGPLAN Symposium on Scala Pub Date : 2011-11-14 DOI: 10.1145/2133173.2133183
Jun He, J. Kowalkowski, M. Paterno, D. Holmgren, J. Simone, Xian-He Sun
{"title":"Layout-aware scientific computing: a case study using MILC","authors":"Jun He, J. Kowalkowski, M. Paterno, D. Holmgren, J. Simone, Xian-He Sun","doi":"10.1145/2133173.2133183","DOIUrl":"https://doi.org/10.1145/2133173.2133183","url":null,"abstract":"Nowadays, high performance computers have more cores and nodes than ever before. Computation is spread out among them, leading to more communication. For this reason, communication can easily become the bottleneck of a system and limit its scalability. The layout of an application on a computer is the key factor to preserve communication locality and reduce its cost. In this paper, we propose a simple model to optimize the layout for scientific applications by minimizing inter-node communication cost. The model takes into account the latency and bandwidth of the network and associates them with the dominant layout variables of the application. We take MILC as an example and analyze its communication patterns. According to our experimental results, the model developed for MILC achieved a satisfactory accuracy for predicting the performance, leading to up to 31% performance improvement.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121588533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Top down programming methodology and tools with StarSs - enabling scalable programming paradigms: extended abstract 自顶向下的编程方法和工具与stars -支持可扩展的编程范例:扩展抽象
ACM SIGPLAN Symposium on Scala Pub Date : 2011-11-14 DOI: 10.1145/2133173.2133182
Rosa M. Badia
{"title":"Top down programming methodology and tools with StarSs - enabling scalable programming paradigms: extended abstract","authors":"Rosa M. Badia","doi":"10.1145/2133173.2133182","DOIUrl":"https://doi.org/10.1145/2133173.2133182","url":null,"abstract":"Current supercomputers are evolving to clusters with a very large number of nodes, and what is more, the nodes are each time becoming more complex composed of several multicore chips and GPUs. With such architectures, the application developers are every time facing a more complex task. On the other hand, most HPC applications are scientific legacy codes written in MPI and designed for at most thousands of processors. Current efforts deal with extending these applications to scale to larger number of cores and to be combined with CUDA or OpenCL to efficienly run on GPUs.\u0000 To evolve a given application to be suitable to run in new heterogeneous supercomputers, application developers can take different alternatives. Optimizations to improve the MPI bottlenecks, for example, by using asynchronous communications, or optimizations on the sequential code to improve its locality, or optimizations at the node level to avoid resource contention, to list a few.\u0000 This paper proposes a methodology to enable current MPI applications to be improved using the MPI/StarSs programming model. StarSs [2] is a task-based programming model that enables to parallelize sequential applications by means of annotating the code with compiler directives. What is more important, it supports their execution in heterogeneous platforms, including clusters of GPUs. Also it nicely hybridizes with MPI [1], and enables the overlap of communication and computation.\u0000 The approach is based on the generation at execution time of a directed acyclic graph (DAG), where the nodes of the graph denote tasks in the application and edges denote data dependences between tasks. Once a partial DAG has been generated, the StarSs runtime is able to schedule the tasks to the different cores or GPUs of the platform.\u0000 Another relevant aspect is that the programming model offers to the application developers a single name space while the actual memory addresses can be distributed (as in a cluster or a node with GPUs). The StarSs runtime maintains a hierarchical directory with information about where to find each block of data and different software caches are maintained in each of the distributed memory spaces. The runtime is responsible for transferring the data between the different memory spaces and for keeping the coherence.\u0000 While the programming model itself comes with a very simple syntax, identifying tasks may sometimes not be as easy as one can predict, especially when trying to taskify MPI applications. With the purpose of simplifying this process, a set of tools has been developed to conform with the framework: Ssgrind, that helps identifying tasks and the directionality of the tasksâǍŹ parameters, Ayudame and Temanejo, to help debugging StarSs applications, and Paraver, Cube and Scalasca, that enable a detailed performance analysis of the applications. The extended version of the paper will detail the programming methodology outlined illustrating it with examples.","PeriodicalId":259517,"journal":{"name":"ACM SIGPLAN Symposium on Scala","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-11-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130002104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信