{"title":"SVD算法在Trident处理器上的性能分析","authors":"M. Soliman, S. Sedukhin","doi":"10.1109/CW.2002.1180865","DOIUrl":null,"url":null,"abstract":"Within the current decade, process technology is promising one billion transistors on a single die, operating at frequency of from 6 to 10 GHz. As a direct result of the fundamental trends of increasing transistors density and switching speeds, newer technological and microarchitectural design constrains are introduced. Among them, wire delays will become critical. To take the benefits of the VLSI technology, we proposed Trident processor, which emphasizes on local communication. Like vector architectures, Trident processor extends a scalar core with parallel lanes; each lane contains an execution datapath and a slice of register file. However, Trident processor uses ring and communication registers, which are based on local communication, to store and cyclically shift 1-D data within and across the lanes, respectively. By using parallel datapaths, ring, and communication registers, Trident processor can effectively process not only vector but also matrix data. In this paper, the performance of the Trident processor on singular value decomposition (SVD) algorithm is evaluated. On 500/spl times/600 input matrix, four lanes Trident processor significantly reduces the number of instructions (44 times less), loop overhead (30 times less), and load/store operations (3 times less) comparing with a scalar code. Moreover, Trident processor is scalable and its scalability needs only replicating lanes to process longer vectors or larger matrices (eight lanes can speedup SVD by 2.5 times over four lanes).","PeriodicalId":376322,"journal":{"name":"First International Symposium on Cyber Worlds, 2002. Proceedings.","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2002-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Performance analysis of SVD algorithm on the Trident processor\",\"authors\":\"M. Soliman, S. Sedukhin\",\"doi\":\"10.1109/CW.2002.1180865\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Within the current decade, process technology is promising one billion transistors on a single die, operating at frequency of from 6 to 10 GHz. As a direct result of the fundamental trends of increasing transistors density and switching speeds, newer technological and microarchitectural design constrains are introduced. Among them, wire delays will become critical. To take the benefits of the VLSI technology, we proposed Trident processor, which emphasizes on local communication. Like vector architectures, Trident processor extends a scalar core with parallel lanes; each lane contains an execution datapath and a slice of register file. However, Trident processor uses ring and communication registers, which are based on local communication, to store and cyclically shift 1-D data within and across the lanes, respectively. By using parallel datapaths, ring, and communication registers, Trident processor can effectively process not only vector but also matrix data. In this paper, the performance of the Trident processor on singular value decomposition (SVD) algorithm is evaluated. On 500/spl times/600 input matrix, four lanes Trident processor significantly reduces the number of instructions (44 times less), loop overhead (30 times less), and load/store operations (3 times less) comparing with a scalar code. Moreover, Trident processor is scalable and its scalability needs only replicating lanes to process longer vectors or larger matrices (eight lanes can speedup SVD by 2.5 times over four lanes).\",\"PeriodicalId\":376322,\"journal\":{\"name\":\"First International Symposium on Cyber Worlds, 2002. Proceedings.\",\"volume\":\"12 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2002-11-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"First International Symposium on Cyber Worlds, 2002. Proceedings.\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CW.2002.1180865\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"First International Symposium on Cyber Worlds, 2002. Proceedings.","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CW.2002.1180865","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Performance analysis of SVD algorithm on the Trident processor
Within the current decade, process technology is promising one billion transistors on a single die, operating at frequency of from 6 to 10 GHz. As a direct result of the fundamental trends of increasing transistors density and switching speeds, newer technological and microarchitectural design constrains are introduced. Among them, wire delays will become critical. To take the benefits of the VLSI technology, we proposed Trident processor, which emphasizes on local communication. Like vector architectures, Trident processor extends a scalar core with parallel lanes; each lane contains an execution datapath and a slice of register file. However, Trident processor uses ring and communication registers, which are based on local communication, to store and cyclically shift 1-D data within and across the lanes, respectively. By using parallel datapaths, ring, and communication registers, Trident processor can effectively process not only vector but also matrix data. In this paper, the performance of the Trident processor on singular value decomposition (SVD) algorithm is evaluated. On 500/spl times/600 input matrix, four lanes Trident processor significantly reduces the number of instructions (44 times less), loop overhead (30 times less), and load/store operations (3 times less) comparing with a scalar code. Moreover, Trident processor is scalable and its scalability needs only replicating lanes to process longer vectors or larger matrices (eight lanes can speedup SVD by 2.5 times over four lanes).