{"title":"FPGA高级综合中MGS-QRD算法的环路优化","authors":"Chong Yeam Tan, C. Y. Ooi, N. Ismail","doi":"10.1109/SOCC46988.2019.1570548480","DOIUrl":null,"url":null,"abstract":"The best-known Modified Gram-Schmidt QR decomposition (MGS-QRD) algorithm contains many dependency problems in the aspects of data, memory, loop and control that hinder the high-level synthesis from optimizing the algorithm. So, we present a well-formed algorithm structure to reduce latency and hardware resources. We also present the second MGS-QRD algorithm to further reduce the DSP usage and support bigger QR decomposition size. The proposed algorithms achieve better overall performance than the best-known MGSQRD algorithm. Mapped to an Intel Arria 10 FPGA device, we achieve 0.53us for an 8x8 real QRD of the first proposed algorithm, and 0.59us for an 8x8 real QRD of the second proposed algorithm in the implemented system latency. Various HLS optimization steps and dependence analysis are also provided to improve the performance, it shows an approximately 44 times increase in QRD throughput.","PeriodicalId":253998,"journal":{"name":"2019 32nd IEEE International System-on-Chip Conference (SOCC)","volume":"65 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Loop Optimizations of MGS-QRD Algorithm for FPGA High-Level Synthesis\",\"authors\":\"Chong Yeam Tan, C. Y. Ooi, N. Ismail\",\"doi\":\"10.1109/SOCC46988.2019.1570548480\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The best-known Modified Gram-Schmidt QR decomposition (MGS-QRD) algorithm contains many dependency problems in the aspects of data, memory, loop and control that hinder the high-level synthesis from optimizing the algorithm. So, we present a well-formed algorithm structure to reduce latency and hardware resources. We also present the second MGS-QRD algorithm to further reduce the DSP usage and support bigger QR decomposition size. The proposed algorithms achieve better overall performance than the best-known MGSQRD algorithm. Mapped to an Intel Arria 10 FPGA device, we achieve 0.53us for an 8x8 real QRD of the first proposed algorithm, and 0.59us for an 8x8 real QRD of the second proposed algorithm in the implemented system latency. Various HLS optimization steps and dependence analysis are also provided to improve the performance, it shows an approximately 44 times increase in QRD throughput.\",\"PeriodicalId\":253998,\"journal\":{\"name\":\"2019 32nd IEEE International System-on-Chip Conference (SOCC)\",\"volume\":\"65 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 32nd IEEE International System-on-Chip Conference (SOCC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SOCC46988.2019.1570548480\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 32nd IEEE International System-on-Chip Conference (SOCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SOCC46988.2019.1570548480","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
摘要
最著名的改进Gram-Schmidt QR分解(Modified Gram-Schmidt QR decomposition, MGS-QRD)算法在数据、内存、循环和控制等方面存在许多依赖问题,阻碍了高级合成对算法的优化。因此,我们提出了一种格式良好的算法结构,以减少延迟和硬件资源。我们还提出了第二种MGS-QRD算法,以进一步减少DSP的使用并支持更大的QR分解大小。本文提出的算法比最著名的MGSQRD算法具有更好的综合性能。在实现的系统延迟中,我们将第一种算法的8x8实QRD映射到Intel Arria 10 FPGA器件上,实现了0.53us,第二种算法的8x8实QRD实现了0.59us。还提供了各种HLS优化步骤和依赖性分析来提高性能,它显示QRD吞吐量增加了大约44倍。
Loop Optimizations of MGS-QRD Algorithm for FPGA High-Level Synthesis
The best-known Modified Gram-Schmidt QR decomposition (MGS-QRD) algorithm contains many dependency problems in the aspects of data, memory, loop and control that hinder the high-level synthesis from optimizing the algorithm. So, we present a well-formed algorithm structure to reduce latency and hardware resources. We also present the second MGS-QRD algorithm to further reduce the DSP usage and support bigger QR decomposition size. The proposed algorithms achieve better overall performance than the best-known MGSQRD algorithm. Mapped to an Intel Arria 10 FPGA device, we achieve 0.53us for an 8x8 real QRD of the first proposed algorithm, and 0.59us for an 8x8 real QRD of the second proposed algorithm in the implemented system latency. Various HLS optimization steps and dependence analysis are also provided to improve the performance, it shows an approximately 44 times increase in QRD throughput.