Implementation of an Area Efficient High Throughput Architecture for Sparse Matrix LU Factorization

2019 3rd International Conference on Electronics, Materials Engineering & Nano-Technology (IEMENTech) Pub Date : 2019-08-01 DOI:10.1109/IEMENTech48150.2019.8981319

G. P. Kumar, Chinthala Ramesh

{"title":"Implementation of an Area Efficient High Throughput Architecture for Sparse Matrix LU Factorization","authors":"G. P. Kumar, Chinthala Ramesh","doi":"10.1109/IEMENTech48150.2019.8981319","DOIUrl":null,"url":null,"abstract":"In many scientific computations, Lower-upper (LU) decomposition is an important computing step, as most of the scientific applications are modeled using linear equations Ax=b. The Linear equations are used in our everyday life applications such as profit prediction in the business, income over time, mileage rate calculation. The complexity of data makes difficult to parallelize the LU decomposition. Because parallelization of LU decomposition improves the speed of solving LU factorization and reduces the delay in critical applications range from weather forecasting to power system problems-load flow computation. Field Programmable Gate Array (FPGA) is having more logic resources and parallel computing to speed up the matrix decomposition. In this work an area efficient High Throughput Architecture is designed for Sparse Matrix LU factorization by changing/modifying the computing steps in algorithm. The problem with the KLU algorithm is it occupies more area and the throughput is less when compared with the modified KLU algorithm. The area is reduced by the 10%. The hardware complexity of implementation of sparse LU Factorization on FPGA is 15% less when compared with CPU & GPU [4] and also the computing efficiency i.e., throughput 10% to 12% on GPU&CPU do not reach theoretical computing efficiency (theoretical peak throughput).The hardware efficiency (typically 1 to 4%) of UMFPACK and SuperLU, are very less due to poor utilization of Floating point.","PeriodicalId":243805,"journal":{"name":"2019 3rd International Conference on Electronics, Materials Engineering & Nano-Technology (IEMENTech)","volume":"194 5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 3rd International Conference on Electronics, Materials Engineering & Nano-Technology (IEMENTech)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IEMENTech48150.2019.8981319","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

In many scientific computations, Lower-upper (LU) decomposition is an important computing step, as most of the scientific applications are modeled using linear equations Ax=b. The Linear equations are used in our everyday life applications such as profit prediction in the business, income over time, mileage rate calculation. The complexity of data makes difficult to parallelize the LU decomposition. Because parallelization of LU decomposition improves the speed of solving LU factorization and reduces the delay in critical applications range from weather forecasting to power system problems-load flow computation. Field Programmable Gate Array (FPGA) is having more logic resources and parallel computing to speed up the matrix decomposition. In this work an area efficient High Throughput Architecture is designed for Sparse Matrix LU factorization by changing/modifying the computing steps in algorithm. The problem with the KLU algorithm is it occupies more area and the throughput is less when compared with the modified KLU algorithm. The area is reduced by the 10%. The hardware complexity of implementation of sparse LU Factorization on FPGA is 15% less when compared with CPU & GPU [4] and also the computing efficiency i.e., throughput 10% to 12% on GPU&CPU do not reach theoretical computing efficiency (theoretical peak throughput).The hardware efficiency (typically 1 to 4%) of UMFPACK and SuperLU, are very less due to poor utilization of Floating point.

查看原文本刊更多论文

稀疏矩阵LU分解的区域高效高吞吐量架构实现

在许多科学计算中，上下分解(LU)是一个重要的计算步骤，因为大多数科学应用都使用线性方程Ax=b来建模。线性方程在我们的日常生活中应用，如商业中的利润预测，随着时间的推移的收入，里程率计算。数据的复杂性使得逻辑单元分解难以并行化。由于并行化的逻辑单元分解提高了逻辑单元分解的求解速度，减少了从天气预报到电力系统负荷流计算等关键应用的延迟。现场可编程门阵列(FPGA)具有更多的逻辑资源和并行计算能力来加快矩阵分解速度。本文通过改变/修改稀疏矩阵LU分解算法的计算步骤，设计了一种区域高效的高吞吐量架构。与改进的KLU算法相比，KLU算法的问题是占用的面积更大，吞吐量更低。面积减少了10%在FPGA上实现稀疏LU分解的硬件复杂度比CPU & GPU[4]低15%，并且计算效率(即GPU&CPU上10% ~ 12%的吞吐量)没有达到理论计算效率(理论峰值吞吐量)。UMFPACK和SuperLU的硬件效率(通常为1%到4%)由于浮点利用率差而非常低。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2019 3rd International Conference on Electronics, Materials Engineering & Nano-Technology (IEMENTech)

自引率

0.00%

发文量