{"title":"MV-FT: Efficient Implementation for Matrix-Vector Multiplication on FT64 Stream Processor","authors":"Jing Du, F. Ao, Xuejun Yang","doi":"10.1109/ICDS.2008.16","DOIUrl":null,"url":null,"abstract":"In this paper, we present a detailed case study of the optimizing implementation of a fundamental scientific kernel, matrix-vector multiplication, on FT64, which is the first 64-bit stream processor designed for scientific computing. The major novelties of our study are as follows. First, we develop four stream programs according to different stream organizations, involving dot product, row product, multi-dot product and multi-row product approaches. Second the optimal strip size for partitioning the large matrix is put forward based on a practical parameter model. Finally loop unrolling and software pipelining are used to hide the communications with the computations. The experimental results show that the optimizing implementations on FT64 achieve high speedup over the corresponding Fortran programs running on Itanium 2. It is certain that matrix-vector multiplication can efficiently exploit the tremendous potential of FT64 stream processor through programming optimizations.","PeriodicalId":422080,"journal":{"name":"Second International Conference on the Digital Society","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-02-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Second International Conference on the Digital Society","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDS.2008.16","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
In this paper, we present a detailed case study of the optimizing implementation of a fundamental scientific kernel, matrix-vector multiplication, on FT64, which is the first 64-bit stream processor designed for scientific computing. The major novelties of our study are as follows. First, we develop four stream programs according to different stream organizations, involving dot product, row product, multi-dot product and multi-row product approaches. Second the optimal strip size for partitioning the large matrix is put forward based on a practical parameter model. Finally loop unrolling and software pipelining are used to hide the communications with the computations. The experimental results show that the optimizing implementations on FT64 achieve high speedup over the corresponding Fortran programs running on Itanium 2. It is certain that matrix-vector multiplication can efficiently exploit the tremendous potential of FT64 stream processor through programming optimizations.