FLOPs作为密集线性代数算法的判别

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-07-05 DOI:10.1145/3545008.3545072

F. L'opez, L. Karlsson, P. Bientinesi

{"title":"FLOPs作为密集线性代数算法的判别","authors":"F. L'opez, L. Karlsson, P. Bientinesi","doi":"10.1145/3545008.3545072","DOIUrl":null,"url":null,"abstract":"Expressions that involve matrices and vectors, known as linear algebra expressions, are commonly evaluated through a sequence of invocations to highly optimised kernels provided in libraries such as BLAS and LAPACK. A sequence of kernels represents an algorithm, and in general, because of associativity, algebraic identities, and multiple kernels, one expression can be evaluated via many different algorithms. These algorithms are all mathematically equivalent (i.e., in exact arithmetic, they all compute the same result), but often differ noticeably in terms of execution time. When faced with a decision, high-level languages, libraries, and tools such as Julia, Armadillo, and Linnea choose by selecting the algorithm that minimises the FLOP count. In this paper, we test the validity of the FLOP count as a discriminant for dense linear algebra algorithms, analysing ”anomalies”: problem instances for which the fastest algorithm does not perform the least number of FLOPs. To do so, we focused on relatively simple expressions and analysed when and why anomalies occurred. We found that anomalies exist and tend to cluster into large contiguous regions. For one expression anomalies were rare, whereas for the other they were abundant. We conclude that FLOPs is not a sufficiently dependable discriminant even when building algorithms with highly optimised kernels. Plus, most of the anomalies remained as such even after filtering out the inter-kernel cache effects. We conjecture that combining FLOP counts with kernel performance models will significantly improve our ability to choose optimal algorithms.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"100 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"FLOPs as a Discriminant for Dense Linear Algebra Algorithms\",\"authors\":\"F. L'opez, L. Karlsson, P. Bientinesi\",\"doi\":\"10.1145/3545008.3545072\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Expressions that involve matrices and vectors, known as linear algebra expressions, are commonly evaluated through a sequence of invocations to highly optimised kernels provided in libraries such as BLAS and LAPACK. A sequence of kernels represents an algorithm, and in general, because of associativity, algebraic identities, and multiple kernels, one expression can be evaluated via many different algorithms. These algorithms are all mathematically equivalent (i.e., in exact arithmetic, they all compute the same result), but often differ noticeably in terms of execution time. When faced with a decision, high-level languages, libraries, and tools such as Julia, Armadillo, and Linnea choose by selecting the algorithm that minimises the FLOP count. In this paper, we test the validity of the FLOP count as a discriminant for dense linear algebra algorithms, analysing ”anomalies”: problem instances for which the fastest algorithm does not perform the least number of FLOPs. To do so, we focused on relatively simple expressions and analysed when and why anomalies occurred. We found that anomalies exist and tend to cluster into large contiguous regions. For one expression anomalies were rare, whereas for the other they were abundant. We conclude that FLOPs is not a sufficiently dependable discriminant even when building algorithms with highly optimised kernels. Plus, most of the anomalies remained as such even after filtering out the inter-kernel cache effects. We conjecture that combining FLOP counts with kernel performance models will significantly improve our ability to choose optimal algorithms.\",\"PeriodicalId\":360504,\"journal\":{\"name\":\"Proceedings of the 51st International Conference on Parallel Processing\",\"volume\":\"100 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-07-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 51st International Conference on Parallel Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3545008.3545072\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 51st International Conference on Parallel Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3545008.3545072","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

涉及矩阵和向量的表达式(称为线性代数表达式)通常通过对BLAS和LAPACK等库中提供的高度优化的内核的一系列调用来求值。一个核序列表示一个算法，一般来说，由于结合律、代数恒等式和多核，一个表达式可以通过许多不同的算法求值。这些算法在数学上都是等价的(即，在精确的算术中，它们都计算相同的结果)，但在执行时间方面通常有显著差异。当面临决策时，高级语言、库和工具(如Julia、Armadillo和Linnea)通过选择最小化FLOP计数的算法来进行选择。在本文中，我们测试了FLOP计数作为密集线性代数算法的判别法的有效性，分析了“异常”:最快的算法不会执行最少数量的FLOPs的问题实例。为此，我们将重点放在相对简单的表达式上，并分析异常发生的时间和原因。我们发现异常存在，并且倾向于聚集成大的连续区域。一种表达异常罕见，而另一种表达异常丰富。我们得出结论，即使在构建具有高度优化内核的算法时，FLOPs也不是一个足够可靠的判别法。此外，即使在过滤掉内核间缓存效果之后，大多数异常仍然存在。我们推测，将FLOP计数与内核性能模型相结合将显著提高我们选择最优算法的能力。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

FLOPs as a Discriminant for Dense Linear Algebra Algorithms

Expressions that involve matrices and vectors, known as linear algebra expressions, are commonly evaluated through a sequence of invocations to highly optimised kernels provided in libraries such as BLAS and LAPACK. A sequence of kernels represents an algorithm, and in general, because of associativity, algebraic identities, and multiple kernels, one expression can be evaluated via many different algorithms. These algorithms are all mathematically equivalent (i.e., in exact arithmetic, they all compute the same result), but often differ noticeably in terms of execution time. When faced with a decision, high-level languages, libraries, and tools such as Julia, Armadillo, and Linnea choose by selecting the algorithm that minimises the FLOP count. In this paper, we test the validity of the FLOP count as a discriminant for dense linear algebra algorithms, analysing ”anomalies”: problem instances for which the fastest algorithm does not perform the least number of FLOPs. To do so, we focused on relatively simple expressions and analysed when and why anomalies occurred. We found that anomalies exist and tend to cluster into large contiguous regions. For one expression anomalies were rare, whereas for the other they were abundant. We conclude that FLOPs is not a sufficiently dependable discriminant even when building algorithms with highly optimised kernels. Plus, most of the anomalies remained as such even after filtering out the inter-kernel cache effects. We conjecture that combining FLOP counts with kernel performance models will significantly improve our ability to choose optimal algorithms.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 51st International Conference on Parallel Processing

自引率

0.00%

发文量