分布式存储并发计算机上快速可扩展的通用矩阵乘法算法

Proceedings 11th International Parallel Processing Symposium Pub Date : 1997-04-01 DOI:10.1109/IPPS.1997.580916

J. Choi

{"title":"分布式存储并发计算机上快速可扩展的通用矩阵乘法算法","authors":"J. Choi","doi":"10.1109/IPPS.1997.580916","DOIUrl":null,"url":null,"abstract":"The author presents a fast and scalable matrix multiplication algorithm on distributed memory concurrent computers, whose performance is independent of data distribution on processors, and call it DIMMA (distribution-independent matrix multiplication algorithm). The algorithm is based on two new ideas; it uses a modified pipelined communication scheme to overlap computation and communication effectively, and exploits the LCM block concept to obtain the maximum performance of the sequential BLAS routine in each processor when the block size is too small as well as too large. The algorithm is implemented and compared with SUMMA on the Intel Paragon computer.","PeriodicalId":145892,"journal":{"name":"Proceedings 11th International Parallel Processing Symposium","volume":"78 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1997-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"29","resultStr":"{\"title\":\"A fast scalable universal matrix multiplication algorithm on distributed-memory concurrent computers\",\"authors\":\"J. Choi\",\"doi\":\"10.1109/IPPS.1997.580916\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The author presents a fast and scalable matrix multiplication algorithm on distributed memory concurrent computers, whose performance is independent of data distribution on processors, and call it DIMMA (distribution-independent matrix multiplication algorithm). The algorithm is based on two new ideas; it uses a modified pipelined communication scheme to overlap computation and communication effectively, and exploits the LCM block concept to obtain the maximum performance of the sequential BLAS routine in each processor when the block size is too small as well as too large. The algorithm is implemented and compared with SUMMA on the Intel Paragon computer.\",\"PeriodicalId\":145892,\"journal\":{\"name\":\"Proceedings 11th International Parallel Processing Symposium\",\"volume\":\"78 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1997-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"29\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings 11th International Parallel Processing Symposium\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/IPPS.1997.580916\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings 11th International Parallel Processing Symposium","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPPS.1997.580916","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 29

摘要

本文提出了一种适用于分布式内存并发计算机的快速可扩展矩阵乘法算法，该算法的性能与处理器上的数据分布无关，并将其命名为DIMMA(分布无关矩阵乘法算法)。该算法基于两个新思想;采用改进的流水线通信方案，有效地实现了计算和通信的重叠，并利用LCM块的概念，在块大小过大和过小的情况下，使每个处理器上的顺序BLAS例程的性能达到最大。在Intel Paragon计算机上实现了该算法，并与SUMMA进行了比较。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

A fast scalable universal matrix multiplication algorithm on distributed-memory concurrent computers

The author presents a fast and scalable matrix multiplication algorithm on distributed memory concurrent computers, whose performance is independent of data distribution on processors, and call it DIMMA (distribution-independent matrix multiplication algorithm). The algorithm is based on two new ideas; it uses a modified pipelined communication scheme to overlap computation and communication effectively, and exploits the LCM block concept to obtain the maximum performance of the sequential BLAS routine in each processor when the block size is too small as well as too large. The algorithm is implemented and compared with SUMMA on the Intel Paragon computer.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings 11th International Parallel Processing Symposium

自引率

0.00%

发文量