Advancing Matrix Decomposition Efficiency: A Study on FT-Matrix DSP Based SVD Optimization

IF 3.7 3区计算机科学 Q2 COMPUTER SCIENCE, INFORMATION SYSTEMS

Journal of Cloud Computing-Advances Systems and Applications Pub Date : 2023-07-01 DOI:10.1109/CSCloud-EdgeCom58631.2023.00085

Anxing Xie, Yonghua Hu, Aobo Cheng, Zhuoyou Tang, P. Liu, Xin Zhang

{"title":"Advancing Matrix Decomposition Efficiency: A Study on FT-Matrix DSP Based SVD Optimization","authors":"Anxing Xie, Yonghua Hu, Aobo Cheng, Zhuoyou Tang, P. Liu, Xin Zhang","doi":"10.1109/CSCloud-EdgeCom58631.2023.00085","DOIUrl":null,"url":null,"abstract":"Matrix decomposition is a fundamental operation in linear algebra, and it has various applications in machine learning, signal processing, edge computing, and many other fields. Singular Value Decomposition (SVD) is a matrix decomposition method that can break down a matrix into three matrices: two orthogonal matrices and a diagonal matrix. With the development of domestic high-performance Digital Signal Value Processors (DSP), the demand for matrix computation based on DSP platforms is increasing. The research of SVD implemented based on DSP is important and meaningful. However, accessing the high-performance algorithm requires developers who are familiar with the hardware characteristics, in order to combine the unique features of the algorithm with the limited hardware resources. To reduce the cost of computing the SVD in matrix, we implement a vectorization mapping method for the SVD algorithm on the FT-M7002. The single instruction multiple data (SIMD) instructions embedded in the FT-M7002 processor were utilized to exploit the data-level parallelism in the SVD algorithm. Instead of using data movement and a scalar processing unit (SPU), we compute with a single vector processing element (VPE). Additionally, DMA transfer algorithm is designed to implement matrix transposition and resolve the issue of discontinuous data access. Experimental results show that the optimized SVD algorithm improves execution performance relative to the original SVD algorithm on FT by up to 5.0 ×. Furthermore, we demonstrate that the optimized SVD algorithm on the FT-M7002 performs 1.0-2.0× faster than the optimized SVD algorithm on TMS320C6678 processor.","PeriodicalId":56007,"journal":{"name":"Journal of Cloud Computing-Advances Systems and Applications","volume":"97 1","pages":"464-469"},"PeriodicalIF":3.7000,"publicationDate":"2023-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Cloud Computing-Advances Systems and Applications","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1109/CSCloud-EdgeCom58631.2023.00085","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}

引用次数: 0

Abstract

Matrix decomposition is a fundamental operation in linear algebra, and it has various applications in machine learning, signal processing, edge computing, and many other fields. Singular Value Decomposition (SVD) is a matrix decomposition method that can break down a matrix into three matrices: two orthogonal matrices and a diagonal matrix. With the development of domestic high-performance Digital Signal Value Processors (DSP), the demand for matrix computation based on DSP platforms is increasing. The research of SVD implemented based on DSP is important and meaningful. However, accessing the high-performance algorithm requires developers who are familiar with the hardware characteristics, in order to combine the unique features of the algorithm with the limited hardware resources. To reduce the cost of computing the SVD in matrix, we implement a vectorization mapping method for the SVD algorithm on the FT-M7002. The single instruction multiple data (SIMD) instructions embedded in the FT-M7002 processor were utilized to exploit the data-level parallelism in the SVD algorithm. Instead of using data movement and a scalar processing unit (SPU), we compute with a single vector processing element (VPE). Additionally, DMA transfer algorithm is designed to implement matrix transposition and resolve the issue of discontinuous data access. Experimental results show that the optimized SVD algorithm improves execution performance relative to the original SVD algorithm on FT by up to 5.0 ×. Furthermore, we demonstrate that the optimized SVD algorithm on the FT-M7002 performs 1.0-2.0× faster than the optimized SVD algorithm on TMS320C6678 processor.

查看原文本刊更多论文

提高矩阵分解效率:基于ft矩阵DSP的SVD优化研究

矩阵分解是线性代数中的一项基本运算，在机器学习、信号处理、边缘计算等许多领域都有广泛的应用。奇异值分解(SVD)是一种矩阵分解方法，它可以将一个矩阵分解成三个矩阵:两个正交矩阵和一个对角矩阵。随着国内高性能数字信号值处理器(DSP)的发展，基于DSP平台的矩阵计算需求越来越大。基于DSP实现奇异值分解的研究是非常重要和有意义的。然而，访问高性能算法需要熟悉硬件特性的开发人员，以便将算法的独特特性与有限的硬件资源相结合。为了减少矩阵SVD的计算成本，我们在FT-M7002上实现了SVD算法的矢量化映射方法。利用FT-M7002处理器内嵌的单指令多数据(SIMD)指令，利用SVD算法的数据级并行性。我们不使用数据移动和标量处理单元(SPU)，而是使用单个向量处理元素(VPE)进行计算。另外，设计了DMA传输算法，实现了矩阵变换，解决了数据访问不连续的问题。实验结果表明，与原SVD算法相比，优化后的SVD算法在FT上的执行性能提高了5.0倍。此外，我们还证明了优化后的奇异值分解算法在FT-M7002上的运算速度比在TMS320C6678处理器上的运算速度快1.0-2.0倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

Journal of Cloud Computing-Advances Systems and Applications Computer Science-Computer Networks and Communications

CiteScore

6.80

自引率

7.50%

发文量

审稿时长

75 days

期刊介绍： The Journal of Cloud Computing: Advances, Systems and Applications (JoCCASA) will publish research articles on all aspects of Cloud Computing. Principally, articles will address topics that are core to Cloud Computing, focusing on the Cloud applications, the Cloud systems, and the advances that will lead to the Clouds of the future. Comprehensive review and survey articles that offer up new insights, and lay the foundations for further exploratory and experimental work, are also relevant.