在共享内存原语中表达布尔立方矩阵算法

S. Johnsson, C. T. Ho
{"title":"在共享内存原语中表达布尔立方矩阵算法","authors":"S. Johnsson, C. T. Ho","doi":"10.1145/63047.63121","DOIUrl":null,"url":null,"abstract":"The multiplication of (large) matrices allocated evenly on Boolean cube configured multiprocessors poses several interesting trade-offs with respect to communication time, processor utilization, and storage requirement. In [7] we investigated several algorithms for different degrees of parallelization, and showed how the choice of algorithm with respect to performance depends on the matrix shape, and the multiprocessor parameters, and how processors should be allocated optimally to the different loops.\nIn this paper the focus is on expressing the algorithms in shared memory type primitives. We assume that all processors share the same global address space, and present communication primitives both for nearest-neighbor communication, and global operations such as broadcasting from one processor to a set of processors, the reverse operation of plus-reduction, and matrix transposition (dimension permutation). We consider both the case where communication is restricted to one processor port at a time, or concurrent communication on all processor ports. The communication algorithms are provably optimal within a factor of two. We describe both constant storage algorithms, and algorithms with reduced communication time, but a storage need proportional to the number of processors and the matrix sizes (for a one-dimensional partitioning of the matrices).","PeriodicalId":299435,"journal":{"name":"Conference on Hypercube Concurrent Computers and Applications","volume":"49 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1989-01-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Expressing Boolean cube matrix algorithms in shared memory primitives\",\"authors\":\"S. Johnsson, C. T. Ho\",\"doi\":\"10.1145/63047.63121\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The multiplication of (large) matrices allocated evenly on Boolean cube configured multiprocessors poses several interesting trade-offs with respect to communication time, processor utilization, and storage requirement. In [7] we investigated several algorithms for different degrees of parallelization, and showed how the choice of algorithm with respect to performance depends on the matrix shape, and the multiprocessor parameters, and how processors should be allocated optimally to the different loops.\\nIn this paper the focus is on expressing the algorithms in shared memory type primitives. We assume that all processors share the same global address space, and present communication primitives both for nearest-neighbor communication, and global operations such as broadcasting from one processor to a set of processors, the reverse operation of plus-reduction, and matrix transposition (dimension permutation). We consider both the case where communication is restricted to one processor port at a time, or concurrent communication on all processor ports. The communication algorithms are provably optimal within a factor of two. We describe both constant storage algorithms, and algorithms with reduced communication time, but a storage need proportional to the number of processors and the matrix sizes (for a one-dimensional partitioning of the matrices).\",\"PeriodicalId\":299435,\"journal\":{\"name\":\"Conference on Hypercube Concurrent Computers and Applications\",\"volume\":\"49 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1989-01-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Conference on Hypercube Concurrent Computers and Applications\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/63047.63121\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Conference on Hypercube Concurrent Computers and Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/63047.63121","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6

摘要

在布尔立方体配置的多处理器上均匀分配的(大)矩阵的乘法在通信时间、处理器利用率和存储需求方面带来了一些有趣的权衡。在[7]中,我们研究了几种不同程度并行化的算法,并展示了如何选择与性能相关的算法取决于矩阵形状和多处理器参数,以及如何将处理器最佳地分配给不同的循环。本文的重点是在共享内存类型原语中表达算法。我们假设所有处理器共享相同的全局地址空间,并提供用于最近邻通信和全局操作的通信原语,例如从一个处理器广播到一组处理器、加减运算的反向操作和矩阵转置(维置换)。我们考虑了两种情况,一种是通信被限制在一个处理器端口上,另一种是所有处理器端口上的并发通信。可证明,该通信算法在两个因子内是最优的。我们描述了恒定存储算法和减少通信时间的算法,但是存储需求与处理器数量和矩阵大小成正比(对于矩阵的一维划分)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
Expressing Boolean cube matrix algorithms in shared memory primitives
The multiplication of (large) matrices allocated evenly on Boolean cube configured multiprocessors poses several interesting trade-offs with respect to communication time, processor utilization, and storage requirement. In [7] we investigated several algorithms for different degrees of parallelization, and showed how the choice of algorithm with respect to performance depends on the matrix shape, and the multiprocessor parameters, and how processors should be allocated optimally to the different loops. In this paper the focus is on expressing the algorithms in shared memory type primitives. We assume that all processors share the same global address space, and present communication primitives both for nearest-neighbor communication, and global operations such as broadcasting from one processor to a set of processors, the reverse operation of plus-reduction, and matrix transposition (dimension permutation). We consider both the case where communication is restricted to one processor port at a time, or concurrent communication on all processor ports. The communication algorithms are provably optimal within a factor of two. We describe both constant storage algorithms, and algorithms with reduced communication time, but a storage need proportional to the number of processors and the matrix sizes (for a one-dimensional partitioning of the matrices).
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信