{"title":"Unexpected Diversity: Quantitative Memory Analysis for Zynq UltraScale+ Systems","authors":"Kristiyan Manev, Anuj Vaishnav, Dirk Koch","doi":"10.1109/ICFPT47387.2019.00029","DOIUrl":null,"url":null,"abstract":"Memory throughput is one of the major bottlenecks for accelerator performance. Now that Zynq UltraScale+ systems are being deployed at exascale to edge, it is important to understand their characteristics of the memory subsystem and optimizations possible for developers. In this paper, we extensively evaluate the memory performance and behaviour for various AXI port combinations, burst sizes, access patterns, and the number of accelerators per AXI port. Our results on ZCU102 and Ultra 96 boards show that 1) effective throughput of these systems is reaching only 75% and 92.5% of theoretical maximum respectively, 2) 128 and 192 Byte burst size is often optimal, 3) AXI ports of the same type may not always exhibit similar behaviour, 4) multiplexing accelerators in PL can provide better throughput distribution compared to multiplexing in PS, and 5) using all AXI ports does not lead to the highest performance.","PeriodicalId":241340,"journal":{"name":"2019 International Conference on Field-Programmable Technology (ICFPT)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"17","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 International Conference on Field-Programmable Technology (ICFPT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICFPT47387.2019.00029","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 17
Abstract
Memory throughput is one of the major bottlenecks for accelerator performance. Now that Zynq UltraScale+ systems are being deployed at exascale to edge, it is important to understand their characteristics of the memory subsystem and optimizations possible for developers. In this paper, we extensively evaluate the memory performance and behaviour for various AXI port combinations, burst sizes, access patterns, and the number of accelerators per AXI port. Our results on ZCU102 and Ultra 96 boards show that 1) effective throughput of these systems is reaching only 75% and 92.5% of theoretical maximum respectively, 2) 128 and 192 Byte burst size is often optimal, 3) AXI ports of the same type may not always exhibit similar behaviour, 4) multiplexing accelerators in PL can provide better throughput distribution compared to multiplexing in PS, and 5) using all AXI ports does not lead to the highest performance.