2014 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems最新文献

筛选
英文 中文
A Hierarchical Tridiagonal System Solver for Heterogenous Supercomputers 异构超级计算机的分层三对角线系统求解器
Xinliang Wang, Yangtong Xu, Wei Xue
{"title":"A Hierarchical Tridiagonal System Solver for Heterogenous Supercomputers","authors":"Xinliang Wang, Yangtong Xu, Wei Xue","doi":"10.1109/ScalA.2014.12","DOIUrl":"https://doi.org/10.1109/ScalA.2014.12","url":null,"abstract":"Tridiagonal system solver is an important kernel in many scientific and engineering applications. Even though quite a few parallel algorithms and implementations have been addressed in recent years, challenges still remain when solving large-scale tridiagonal system on heterogenous supercomputers. In this paper, a hierarchical algorithm framework SPIKE (pronounced 'SPIKE squared') is proposed to minimize the parallel overhead and to achieve the best utilization of CPU-GPU hybrid systems. In these systems, a layered and adaptive partitioning is presented based on the SPIKE algorithm to effectively control the sequential parts while efficiently exploiting the computation and communication overlapping in heterogeneous computing node. Moreover, the SPIKE algorithm is reformulated to reduce the matrix computations to only 1/3 in our hierarchical algorithm framework. Meanwhile, an improved implementation of the tiled-PCR-pThomas algorithm is employed for the GPU architecture, and the shared memory usage on the GPU can be reduced by 1/3 using careful dependence analysis on solving unit vector tridiagonal systems. Our experiments on Tianhe-1A show ideal weak scalability on up to 128 nodes when solving a tridiagonal system with a size of 1920M in the largest run and good strong scalability (70%) from 32 nodes to 256 nodes when solving a tridiagonal system with a size of 480M. Furthermore, the adaptive task partition across the CPU and GPU can get over 10% performance improvement in the strong scaling test with 256 nodes. In one computing node of Tianhe-1A, our GPU-only code can outperform the CUSPARSE version (non-pivoting tridiagonal solver) by 30%, and our hybrid code is about 6.7 times faster than the Intel SPIKE multi-process version for tridiagonal systems having a size of 3M, 5M, and 15M.","PeriodicalId":323689,"journal":{"name":"2014 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128742329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
VCube: A Provably Scalable Distributed Diagnosis Algorithm VCube:一种可证明可扩展的分布式诊断算法
E. P. Duarte, L. C. E. Bona, Vinicius K. Ruoso
{"title":"VCube: A Provably Scalable Distributed Diagnosis Algorithm","authors":"E. P. Duarte, L. C. E. Bona, Vinicius K. Ruoso","doi":"10.1109/ScalA.2014.14","DOIUrl":"https://doi.org/10.1109/ScalA.2014.14","url":null,"abstract":"VCube is a distributed diagnosis algorithm for virtually interconnecting network nodes. VCube presents several logarithmic properties, and is a logical hypercube when all nodes are fault-free. VCube is dynamic in the sense that nodes can leave and rejoin the system as they become faulty and are repaired. The topology re-organizes itself and keeps its logarithmic properties even if an arbitrary number of nodes are faulty. Fault diagnosis is based on tests. All fault-free nodes of a system with N nodes detect an event with a latency of at most log22 N testing rounds. In this work we specify the algorithm and show that the worst number of tests executed is Nlog2N per log2N rounds. Besides the correctness proofs, experimental results are also given.","PeriodicalId":323689,"journal":{"name":"2014 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130659089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 28
CholeskyQR2: A Simple and Communication-Avoiding Algorithm for Computing a Tall-Skinny QR Factorization on a Large-Scale Parallel System CholeskyQR2:一种简单且避免通信的大规模并行系统高瘦QR分解算法
Takeshi Fukaya, Y. Nakatsukasa, Yuka Yanagisawa, Yusaku Yamamoto
{"title":"CholeskyQR2: A Simple and Communication-Avoiding Algorithm for Computing a Tall-Skinny QR Factorization on a Large-Scale Parallel System","authors":"Takeshi Fukaya, Y. Nakatsukasa, Yuka Yanagisawa, Yusaku Yamamoto","doi":"10.1109/ScalA.2014.11","DOIUrl":"https://doi.org/10.1109/ScalA.2014.11","url":null,"abstract":"Designing communication-avoiding algorithms is crucial for high performance computing on a large-scale parallel system. The TSQR algorithm is a communication-avoiding algorithm for computing a tall-skinny QR factorization, and TSQR is known to be much faster and as stable as the classical Householder QR algorithm. The Cholesky QR algorithm is another very simple and fast communication-avoiding algorithm, but rarely used in practice because of its numerical instability. Our recent work points out that an algorithm that simply repeats Cholesky QR twice, which we call CholeskyQR2, gives excellent accuracy for a wide range of matrices arising in practice. Although the communication cost of CholeskyQR2 is twice that of TSQR, it has an advantage that its reduction operation is addition whereas that of TSQR is a QR factorization, whose high-performance implementation is more difficult. Thus, CholeskyQR2 can potentially be significantly faster than TSQR. Indeed, in our experiments using 16384 nodes of the K computer, CholeskyQR2 ran about three times faster than TSQR for a 4194304 × 64 matrix.","PeriodicalId":323689,"journal":{"name":"2014 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130885435","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
TX: Algorithmic Energy Saving for Distributed Dense Matrix Factorizations 分布式密集矩阵分解的节能算法
Li Tan, Zizhong Chen
{"title":"TX: Algorithmic Energy Saving for Distributed Dense Matrix Factorizations","authors":"Li Tan, Zizhong Chen","doi":"10.1109/ScalA.2014.7","DOIUrl":"https://doi.org/10.1109/ScalA.2014.7","url":null,"abstract":"The pressing demands of improving energy efficiency for high performance scientific computing have motivated a large body of solutions using Dynamic Voltage and Frequency Scaling (DVFS) that strategically switch processors to low-power states, if the peak processor performance is unnecessary. Although OS level solutions have demonstrated the effectiveness of saving energy in a black-box fashion, for applications with variable execution patterns, the optimal energy efficiency can be blundered away due to defective prediction mechanism and untapped load imbalance. In this paper, we propose TX, a library level race-tohalt DVFS scheduling approach that analyzes Task Dependency Set of each task in distributed Cholesky/LU/QR factorizations to achieve substantial energy savings OS level solutions cannot fulfill. Partially giving up the generality of OS level solutions per requiring library level source modification, TX leverages algorithmic characteristics of the applications to gain greater energy savings. Experimental results on two clusters indicate that TX can save up to 17.8% more energy than state-of-the-art OS level solutions with negligible 3.5% on average performance loss.","PeriodicalId":323689,"journal":{"name":"2014 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131313718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Performance and Portability with OpenCL for Throughput-Oriented HPC Workloads across Accelerators, Coprocessors, and Multicore Processors 使用OpenCL实现跨加速器、协处理器和多核处理器的面向吞吐量的高性能计算工作负载的性能和可移植性
Chongxiao Cao, M. Gates, A. Haidar, P. Luszczek, S. Tomov, I. Yamazaki, J. Dongarra
{"title":"Performance and Portability with OpenCL for Throughput-Oriented HPC Workloads across Accelerators, Coprocessors, and Multicore Processors","authors":"Chongxiao Cao, M. Gates, A. Haidar, P. Luszczek, S. Tomov, I. Yamazaki, J. Dongarra","doi":"10.1109/ScalA.2014.8","DOIUrl":"https://doi.org/10.1109/ScalA.2014.8","url":null,"abstract":"Ever since accelerators and coprocessors became the mainstream hardware for throughput-oriented HPC workloads, various programming techniques have been proposed to increase productivity in terms of both the performance and ease-of-use. We evaluate these aspects of OpenCL on a number of hardware platforms for an important subset of dense linear algebra operations that are relevant to a wide range of scientific applications. Our findings indicate that OpenCL portability has improved since our previous publication and many new and surprising usage scenarios are possible that rival those available after decades of software development on the CPUs. The combined performance-portability metric, even though not promised by the OpenCL standard, reflects the need for tuning performance-critical operations during the porting process and we show how a large portion of the available efficiency is lost if the tuning is not done correctly.","PeriodicalId":323689,"journal":{"name":"2014 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","volume":"119 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133377904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Deflation Strategies to Improve the Convergence of Communication-Avoiding GMRES 提高通信回避GMRES收敛性的通缩策略
I. Yamazaki, S. Tomov, J. Dongarra
{"title":"Deflation Strategies to Improve the Convergence of Communication-Avoiding GMRES","authors":"I. Yamazaki, S. Tomov, J. Dongarra","doi":"10.1109/ScalA.2014.6","DOIUrl":"https://doi.org/10.1109/ScalA.2014.6","url":null,"abstract":"The generalized minimum residual (GMRES) method is a popular method for solving a large-scale sparse nonsymmetric linear system of equations. On modern computers, especially on a large-scale system, the communication is becoming increasingly expensive. To address this hardware trend, a communication-avoiding variant of GMRES (CA-GMRES) has become attractive, frequently showing its superior performance over GMRES on various hardware architectures. In practice, to mitigate the increasing costs of explicitly orthogonalizing the projection basis vectors, the iterations of both GMRES and CAGMRES are restarted, which often slows down the solution convergence. To avoid this slowdown and improve the performance of restarted CA-GMRES, in this paper, we study the effectiveness of deflation strategies. Our studies are based on a thick restarted variant of CA-GMRES, which can implicitly deflate a few Ritz vectors, that approximately span an eigenspace of the coefficient matrix, through the standard orthogonalization process. This strategy is mathematically equivalent to the standard thick-restarted GMRES, and it requires only a small computational overhead and does not increase the communication or storage costs of CA-GMRES. Hence, by avoiding the communication, this deflated version of CA-GMRES obtains the same performance benefits over the deflated version of GMRES as the standard CA-GMRES does over GMRES. Our experimental results on a hybrid CPU/GPU cluster demonstrate that thick-restart can significantly improve the convergence and reduce the solution time of CA-GMRES. We also show that this deflation strategy can be combined with a local domain decomposition based preconditioner to further enhance the robustness of CA-GMRES, making it more attractive in practice.","PeriodicalId":323689,"journal":{"name":"2014 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134378837","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Scaling Parallel 3-D FFT with Non-Blocking MPI Collectives 基于非阻塞MPI集合的并行三维FFT缩放
Sukhyun Song, J. Hollingsworth
{"title":"Scaling Parallel 3-D FFT with Non-Blocking MPI Collectives","authors":"Sukhyun Song, J. Hollingsworth","doi":"10.1109/ScalA.2014.9","DOIUrl":"https://doi.org/10.1109/ScalA.2014.9","url":null,"abstract":"This paper describes a new method for scalable high-performance parallel 3-D FFT. We use a 2-D decomposition of 3-D arrays to increase scaling to a large number of cores. In order to achieve high performance, we use non-blocking MPI all-to-all operations and exploit computation-communication overlap. We also auto-tune our 3-D FFT code efficiently in a large parameter space and cope with the complex trade-off in optimizing our code in various system environments. According to experimental results with up to 32,768 cores, our method computes parallel 3-D FFT faster than the FFTW library by up to 1.83×.","PeriodicalId":323689,"journal":{"name":"2014 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","volume":"423 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126714278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
A Framework for Parallel Genetic Algorithms for Distributed Memory Architectures 分布式存储架构并行遗传算法框架
Dobromir Georgiev, E. Atanassov, V. Alexandrov
{"title":"A Framework for Parallel Genetic Algorithms for Distributed Memory Architectures","authors":"Dobromir Georgiev, E. Atanassov, V. Alexandrov","doi":"10.1109/ScalA.2014.13","DOIUrl":"https://doi.org/10.1109/ScalA.2014.13","url":null,"abstract":"Genetic algorithms are metaheuristic search methods, based on the principles of biological evolution and genetics. Through a heuristic search they are able to find good solutions in acceptable time. However, with the increase of the complexity of the fitness landscape and the size of the search space their runtime increases rapidly. Using parallel implementations of genetic algorithms in order to harness the power of modern computational platforms, is a powerful approach to mitigating this issue. In this paper several parallel implementations ranging from MPI to hybrid MPI/OpenMP and MPI/OmpSs are made. These implementations are optimized for execution on tightly coupled distributed memory systems. We address issues that arise when running a distributed genetic algorithm and present an adaptive migration scheme. Comparison of their efficiency is also made.","PeriodicalId":323689,"journal":{"name":"2014 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114730969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
The Anatomy of Mr. Scan: A Dissection of Performance of an Extreme Scale GPU-Based Clustering Algorithm 剖析Mr. Scan:一种基于gpu的极端尺度聚类算法的性能剖析
Benjamin Welton, B. Miller
{"title":"The Anatomy of Mr. Scan: A Dissection of Performance of an Extreme Scale GPU-Based Clustering Algorithm","authors":"Benjamin Welton, B. Miller","doi":"10.1109/ScalA.2014.10","DOIUrl":"https://doi.org/10.1109/ScalA.2014.10","url":null,"abstract":"The emergence of leadership class systems with GPU-equipped nodes has the potential to vastly increase the performance of existing distributed applications. However, the inclusion of GPU computation into existing extreme scale distributed applications can reveal scalability issues that were absent in the CPU version. The issues exposed in scaling by a GPU can become limiting factors to overall application performance. We developed an extreme scale GPU-based application to perform data clustering on multi-billion point datasets. In this application, called Mr. Scan, we ran into several of these performance limiting issues. Through the use of complete end-to-end benchmarking of Mr. Scan (measuring time from reading and distribution to final output), we were able to identify three major sources of real world performance issues: data distribution, GPU load balancing, and system specific issues such as start-up time. These issues comprised a vast majority of the run time of Mr. Scan. Data distribution alone accounted for 68% of the total run time of Mr. Scan when processing 6.5 billion points on Cray Titan at 8192 nodes. With improvements in these areas, we have been able able to cut total run time of Mr. Scan from 17.5 minutes to 8.3 minutes when clustering 6.5 billion points.","PeriodicalId":323689,"journal":{"name":"2014 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","volume":"11 22","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120926786","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Exploiting Data Representation for Fault Tolerance 利用数据表示实现容错
James Elliott, M. Hoemmen, F. Mueller
{"title":"Exploiting Data Representation for Fault Tolerance","authors":"James Elliott, M. Hoemmen, F. Mueller","doi":"10.1109/ScalA.2014.5","DOIUrl":"https://doi.org/10.1109/ScalA.2014.5","url":null,"abstract":"We explore the link between data representation and soft errors in dot products. We present an analytic model for the absolute error introduced should a soft error corrupt a bit in an IEEE-754 floating-point number. We show how this finding relates to the fundamental linear algebra concepts of normalization and matrix equilibration. We present a case study illustrating that the probability of experiencing a large error in a dot product is minimized when both vectors are normalized. Furthermore, when data is normalized we show that the absolute error is less than one or very large, which allows us to detect large errors. We demonstrate how this finding can be used by instrumenting the GMRES iterative solver. We count all possible errors that can be introduced through faults in arithmetic in the computationally intensive orthogonalization phase, and show that when scaling is used the absolute error can be bounded above by one.","PeriodicalId":323689,"journal":{"name":"2014 5th Workshop on Latest Advances in Scalable Algorithms for Large-Scale Systems","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127897846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信