Performance analysis of SVD algorithm on the Trident processor

M. Soliman, S. Sedukhin
{"title":"Performance analysis of SVD algorithm on the Trident processor","authors":"M. Soliman, S. Sedukhin","doi":"10.1109/CW.2002.1180865","DOIUrl":null,"url":null,"abstract":"Within the current decade, process technology is promising one billion transistors on a single die, operating at frequency of from 6 to 10 GHz. As a direct result of the fundamental trends of increasing transistors density and switching speeds, newer technological and microarchitectural design constrains are introduced. Among them, wire delays will become critical. To take the benefits of the VLSI technology, we proposed Trident processor, which emphasizes on local communication. Like vector architectures, Trident processor extends a scalar core with parallel lanes; each lane contains an execution datapath and a slice of register file. However, Trident processor uses ring and communication registers, which are based on local communication, to store and cyclically shift 1-D data within and across the lanes, respectively. By using parallel datapaths, ring, and communication registers, Trident processor can effectively process not only vector but also matrix data. In this paper, the performance of the Trident processor on singular value decomposition (SVD) algorithm is evaluated. On 500/spl times/600 input matrix, four lanes Trident processor significantly reduces the number of instructions (44 times less), loop overhead (30 times less), and load/store operations (3 times less) comparing with a scalar code. Moreover, Trident processor is scalable and its scalability needs only replicating lanes to process longer vectors or larger matrices (eight lanes can speedup SVD by 2.5 times over four lanes).","PeriodicalId":376322,"journal":{"name":"First International Symposium on Cyber Worlds, 2002. Proceedings.","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2002-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"First International Symposium on Cyber Worlds, 2002. Proceedings.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CW.2002.1180865","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Within the current decade, process technology is promising one billion transistors on a single die, operating at frequency of from 6 to 10 GHz. As a direct result of the fundamental trends of increasing transistors density and switching speeds, newer technological and microarchitectural design constrains are introduced. Among them, wire delays will become critical. To take the benefits of the VLSI technology, we proposed Trident processor, which emphasizes on local communication. Like vector architectures, Trident processor extends a scalar core with parallel lanes; each lane contains an execution datapath and a slice of register file. However, Trident processor uses ring and communication registers, which are based on local communication, to store and cyclically shift 1-D data within and across the lanes, respectively. By using parallel datapaths, ring, and communication registers, Trident processor can effectively process not only vector but also matrix data. In this paper, the performance of the Trident processor on singular value decomposition (SVD) algorithm is evaluated. On 500/spl times/600 input matrix, four lanes Trident processor significantly reduces the number of instructions (44 times less), loop overhead (30 times less), and load/store operations (3 times less) comparing with a scalar code. Moreover, Trident processor is scalable and its scalability needs only replicating lanes to process longer vectors or larger matrices (eight lanes can speedup SVD by 2.5 times over four lanes).
SVD算法在Trident处理器上的性能分析
在目前的十年里,工艺技术有望在单个芯片上实现10亿个晶体管,工作频率从6到10千兆赫。增加晶体管密度和开关速度的基本趋势的直接结果是引入了新的技术和微结构设计限制。其中,线路延迟将变得至关重要。为了充分利用VLSI技术的优势,我们提出了强调本地通信的Trident处理器。与矢量架构一样,Trident处理器扩展了具有并行通道的标量内核;每个通道包含一个执行数据路径和一个寄存器文件片。然而,Trident处理器使用基于本地通信的环寄存器和通信寄存器,分别在车道内和车道间存储和循环移动1-D数据。通过使用并行数据路径、环寄存器和通信寄存器,Trident处理器不仅可以有效地处理矢量数据,还可以有效地处理矩阵数据。本文对Trident处理器在奇异值分解(SVD)算法中的性能进行了评价。在500/spl次/600输入矩阵上,与标量代码相比,四通道Trident处理器显着减少了指令数量(减少44倍),循环开销(减少30倍)和加载/存储操作(减少3倍)。此外,Trident处理器是可扩展的,它的可扩展性只需要复制通道来处理更长的向量或更大的矩阵(8通道可以将SVD加速到4通道的2.5倍)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 求助全文
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信