Philipp Holzinger, Daniel Reiser, Tobias Hahn, M. Reichenbach
{"title":"Fast HBM Access with FPGAs: Analysis, Architectures, and Applications","authors":"Philipp Holzinger, Daniel Reiser, Tobias Hahn, M. Reichenbach","doi":"10.1109/IPDPSW52791.2021.00030","DOIUrl":null,"url":null,"abstract":"Over the past few decades, the gap between rapidly increasing computational power and almost stagnating memory bandwidth has steadily worsened. Recently, 3D die-stacking in form of High Bandwidth Memory (HBM) enabled the first major jump in external memory throughput in years. In contrast to traditional DRAM it compensates its lower clock frequency with wide busses and a high number of separate channels. However, this also requires data to be spread out over all channels to reach the full throughput. Previous research relied on manual HBM data partitioning schemes and handled each channel as an entirely independent entity. This paper in contrast also considers scalable hardware adaptions and approaches system design holistically. In this process we first analyze the problem with real world measurements on a Xilinx HBM FPGA. Then we derive several architectural changes to improve throughput and ease accelerator design. Finally, a Roofline based model to more accurately estimate the expected performance in advance is presented. With these measures we were able to increase the throughput by up to 3.78× with random and 40.6× with certain strided access patterns compared to Xilinx’ state-of-the-art switch fabric.","PeriodicalId":170832,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"87 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2021-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2021 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW52791.2021.00030","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
Over the past few decades, the gap between rapidly increasing computational power and almost stagnating memory bandwidth has steadily worsened. Recently, 3D die-stacking in form of High Bandwidth Memory (HBM) enabled the first major jump in external memory throughput in years. In contrast to traditional DRAM it compensates its lower clock frequency with wide busses and a high number of separate channels. However, this also requires data to be spread out over all channels to reach the full throughput. Previous research relied on manual HBM data partitioning schemes and handled each channel as an entirely independent entity. This paper in contrast also considers scalable hardware adaptions and approaches system design holistically. In this process we first analyze the problem with real world measurements on a Xilinx HBM FPGA. Then we derive several architectural changes to improve throughput and ease accelerator design. Finally, a Roofline based model to more accurately estimate the expected performance in advance is presented. With these measures we were able to increase the throughput by up to 3.78× with random and 40.6× with certain strided access patterns compared to Xilinx’ state-of-the-art switch fabric.