Tuning Strassen's Matrix Multiplication for Memory Efficiency

Proceedings of the IEEE/ACM SC98 Conference Pub Date : 1998-11-07 DOI:10.1109/SC.1998.10045

Mithuna Thottethodi, S. Chatterjee, A. Lebeck

{"title":"Tuning Strassen's Matrix Multiplication for Memory Efficiency","authors":"Mithuna Thottethodi, S. Chatterjee, A. Lebeck","doi":"10.1109/SC.1998.10045","DOIUrl":null,"url":null,"abstract":"Strassen's algorithm for matrix multiplication gains its lower arithmetic complexity at the expense of reduced locality of reference, which makes it challenging to implement the algorithm efficiently on a modern machine with a hierarchical memory system. We report on an implementation of this algorithm that uses several unconventional techniques to make the algorithm memory-friendly. First, the algorithm internally uses a non- standard array layout known as Morton order that is based on a quad-tree decomposition of the matrix. Second, we dynamically select the recursion truncation point to minimize padding without affecting the performance of the algorithm, which we can do by virtue of the cache behavior of the Morton ordering. Each technique is critical for performance, and their combination as done in our code multiplies their effectiveness. Performance comparisons of our implementation with that of competing implementations show that our implementation often outperforms the alternative techniques (up to 25%). However, we also observe wide variability across platforms and across matrix sizes, indicating that at this time, no single implementation is a clear choice for all platforms or matrix sizes. We also note that the time required to convert matrices to/from Morton order is a noticeable amount of execution time (5% to 15%). Eliminating this overhead further reduces our execution time.","PeriodicalId":113978,"journal":{"name":"Proceedings of the IEEE/ACM SC98 Conference","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1998-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"89","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the IEEE/ACM SC98 Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SC.1998.10045","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 89

Abstract

Strassen's algorithm for matrix multiplication gains its lower arithmetic complexity at the expense of reduced locality of reference, which makes it challenging to implement the algorithm efficiently on a modern machine with a hierarchical memory system. We report on an implementation of this algorithm that uses several unconventional techniques to make the algorithm memory-friendly. First, the algorithm internally uses a non- standard array layout known as Morton order that is based on a quad-tree decomposition of the matrix. Second, we dynamically select the recursion truncation point to minimize padding without affecting the performance of the algorithm, which we can do by virtue of the cache behavior of the Morton ordering. Each technique is critical for performance, and their combination as done in our code multiplies their effectiveness. Performance comparisons of our implementation with that of competing implementations show that our implementation often outperforms the alternative techniques (up to 25%). However, we also observe wide variability across platforms and across matrix sizes, indicating that at this time, no single implementation is a clear choice for all platforms or matrix sizes. We also note that the time required to convert matrices to/from Morton order is a noticeable amount of execution time (5% to 15%). Eliminating this overhead further reduces our execution time.

查看原文本刊更多论文

调整Strassen矩阵乘法以提高内存效率

Strassen的矩阵乘法算法以降低引用局部性为代价获得了较低的算术复杂度，这使得该算法在具有分层存储系统的现代机器上的高效实现具有挑战性。我们报告了该算法的实现，该算法使用几种非常规技术使算法内存友好。首先，该算法内部使用一种非标准的阵列布局，即基于矩阵四叉树分解的莫顿顺序。其次，我们动态地选择递归截断点，在不影响算法性能的情况下最小化填充，这可以通过Morton排序的缓存行为来实现。每一种技术都对性能至关重要，在我们的代码中，它们的组合将使它们的效率倍增。我们的实现与竞争实现的性能比较表明，我们的实现通常优于替代技术(高达25%)。然而，我们也观察到不同平台和不同矩阵大小之间存在很大的差异，这表明在这个时候，没有一个单一的实现是所有平台或矩阵大小的明确选择。我们还注意到，将矩阵转换为/从Morton顺序所需的时间是一个显著的执行时间(5%到15%)。消除这种开销进一步减少了我们的执行时间。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Proceedings of the IEEE/ACM SC98 Conference

自引率

0.00%

发文量