Tuning Strassen's Matrix Multiplication for Memory Efficiency

Mithuna Thottethodi, S. Chatterjee, A. Lebeck
{"title":"Tuning Strassen's Matrix Multiplication for Memory Efficiency","authors":"Mithuna Thottethodi, S. Chatterjee, A. Lebeck","doi":"10.1109/SC.1998.10045","DOIUrl":null,"url":null,"abstract":"Strassen's algorithm for matrix multiplication gains its lower arithmetic complexity at the expense of reduced locality of reference, which makes it challenging to implement the algorithm efficiently on a modern machine with a hierarchical memory system. We report on an implementation of this algorithm that uses several unconventional techniques to make the algorithm memory-friendly. First, the algorithm internally uses a non- standard array layout known as Morton order that is based on a quad-tree decomposition of the matrix. Second, we dynamically select the recursion truncation point to minimize padding without affecting the performance of the algorithm, which we can do by virtue of the cache behavior of the Morton ordering. Each technique is critical for performance, and their combination as done in our code multiplies their effectiveness. Performance comparisons of our implementation with that of competing implementations show that our implementation often outperforms the alternative techniques (up to 25%). However, we also observe wide variability across platforms and across matrix sizes, indicating that at this time, no single implementation is a clear choice for all platforms or matrix sizes. We also note that the time required to convert matrices to/from Morton order is a noticeable amount of execution time (5% to 15%). Eliminating this overhead further reduces our execution time.","PeriodicalId":113978,"journal":{"name":"Proceedings of the IEEE/ACM SC98 Conference","volume":"17 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1998-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"89","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the IEEE/ACM SC98 Conference","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SC.1998.10045","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 89

Abstract

Strassen's algorithm for matrix multiplication gains its lower arithmetic complexity at the expense of reduced locality of reference, which makes it challenging to implement the algorithm efficiently on a modern machine with a hierarchical memory system. We report on an implementation of this algorithm that uses several unconventional techniques to make the algorithm memory-friendly. First, the algorithm internally uses a non- standard array layout known as Morton order that is based on a quad-tree decomposition of the matrix. Second, we dynamically select the recursion truncation point to minimize padding without affecting the performance of the algorithm, which we can do by virtue of the cache behavior of the Morton ordering. Each technique is critical for performance, and their combination as done in our code multiplies their effectiveness. Performance comparisons of our implementation with that of competing implementations show that our implementation often outperforms the alternative techniques (up to 25%). However, we also observe wide variability across platforms and across matrix sizes, indicating that at this time, no single implementation is a clear choice for all platforms or matrix sizes. We also note that the time required to convert matrices to/from Morton order is a noticeable amount of execution time (5% to 15%). Eliminating this overhead further reduces our execution time.
调整Strassen矩阵乘法以提高内存效率
Strassen的矩阵乘法算法以降低引用局部性为代价获得了较低的算术复杂度,这使得该算法在具有分层存储系统的现代机器上的高效实现具有挑战性。我们报告了该算法的实现,该算法使用几种非常规技术使算法内存友好。首先,该算法内部使用一种非标准的阵列布局,即基于矩阵四叉树分解的莫顿顺序。其次,我们动态地选择递归截断点,在不影响算法性能的情况下最小化填充,这可以通过Morton排序的缓存行为来实现。每一种技术都对性能至关重要,在我们的代码中,它们的组合将使它们的效率倍增。我们的实现与竞争实现的性能比较表明,我们的实现通常优于替代技术(高达25%)。然而,我们也观察到不同平台和不同矩阵大小之间存在很大的差异,这表明在这个时候,没有一个单一的实现是所有平台或矩阵大小的明确选择。我们还注意到,将矩阵转换为/从Morton顺序所需的时间是一个显著的执行时间(5%到15%)。消除这种开销进一步减少了我们的执行时间。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信