An improved parallel singular value algorithm and its implementation for multicore hardware

2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC) Pub Date : 2013-11-17 DOI:10.1145/2503210.2503292

A. Haidar, J. Kurzak, P. Luszczek

{"title":"An improved parallel singular value algorithm and its implementation for multicore hardware","authors":"A. Haidar, J. Kurzak, P. Luszczek","doi":"10.1145/2503210.2503292","DOIUrl":null,"url":null,"abstract":"The enormous gap between the high-performance capabilities of today's CPUs and off-chip communication poses extreme challenges to the development of numerical software that is scalable and achieves high performance. In this article, we describe a successful methodology to address these challenges-starting with our algorithm design, through kernel optimization and tuning, and finishing with our programming model. All these lead to development of a scalable high-performance Singular Value Decomposition (SVD) solver. We developed a set of highly optimized kernels and combined them with advanced optimization techniques that feature fine-grain and cache-contained kernels, a task based approach, and hybrid execution and scheduling runtime, all of which significantly increase the performance of our SVD solver. Our results demonstrate a many-fold performance increase compared to currently available software. In particular, our software is two times faster than Intel's Math Kernel Library (MKL), a highly optimized implementation from the hardware vendor, when all the singular vectors are requested; it achieves a 5-fold speed-up when only 20% of the vectors are computed; and it is up to 10 times faster if only the singular values are required.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"42","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2503210.2503292","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 42

Abstract

The enormous gap between the high-performance capabilities of today's CPUs and off-chip communication poses extreme challenges to the development of numerical software that is scalable and achieves high performance. In this article, we describe a successful methodology to address these challenges-starting with our algorithm design, through kernel optimization and tuning, and finishing with our programming model. All these lead to development of a scalable high-performance Singular Value Decomposition (SVD) solver. We developed a set of highly optimized kernels and combined them with advanced optimization techniques that feature fine-grain and cache-contained kernels, a task based approach, and hybrid execution and scheduling runtime, all of which significantly increase the performance of our SVD solver. Our results demonstrate a many-fold performance increase compared to currently available software. In particular, our software is two times faster than Intel's Math Kernel Library (MKL), a highly optimized implementation from the hardware vendor, when all the singular vectors are requested; it achieves a 5-fold speed-up when only 20% of the vectors are computed; and it is up to 10 times faster if only the singular values are required.

查看原文本刊更多论文

一种改进的并行奇异值算法及其多核硬件实现

当今cpu的高性能能力与片外通信之间的巨大差距对可扩展和实现高性能的数值软件的开发提出了极大的挑战。在本文中，我们描述了一种解决这些挑战的成功方法——从我们的算法设计开始，通过内核优化和调优，最后是我们的编程模型。所有这些都导致了可伸缩的高性能奇异值分解(SVD)求解器的发展。我们开发了一组高度优化的内核，并将它们与高级优化技术相结合，这些优化技术具有细粒度和包含缓存的内核、基于任务的方法以及混合执行和调度运行时的特点，所有这些都显著提高了我们的SVD求解器的性能。我们的结果表明，与目前可用的软件相比，它的性能提高了许多倍。特别是，当请求所有奇异向量时，我们的软件比硬件供应商高度优化的英特尔数学内核库(MKL)快两倍;当只计算20%的向量时，它实现了5倍的加速;如果只需要单个值，速度可以提高10倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)

自引率

0.00%

发文量