{"title":"An Out-of-Order Load-Store Queue for Spatial Computing","authors":"Lana Josipović, P. Brisk, P. Ienne","doi":"10.1145/3126525","DOIUrl":"https://doi.org/10.1145/3126525","url":null,"abstract":"The efficiency of spatial computing depends onthe ability to achieve maximal parallelism. This needs memoryinterfaces that can correctly handle memory accesses arrivingin arbitrary order while still respecting data dependencies andensuring appropriate ordering for semantic correctness. However, a typical memory interface for out-of-order processors (i.e., aload-store queue) cannot immediately fulfill these requirements:a different allocation policy is needed to achieve out-of-orderexecution in a spatial system. We show a practical way toorganize the allocation for an out-of-order load-store queue forspatial computing by dynamically allocating groups of memoryaccesses, where the access order within the group is staticallypredetermined (for instance by a high-level synthesis tool).","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122583414","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Escher: A CNN Accelerator with Flexible Buffering to Minimize Off-Chip Transfer","authors":"Yongming Shen, M. Ferdman, Peter Milder","doi":"10.1109/FCCM.2017.47","DOIUrl":"https://doi.org/10.1109/FCCM.2017.47","url":null,"abstract":"Convolutional neural networks (CNNs) are used to solve many challenging machine learning problems. Interest in CNNs has led to the design of CNN accelerators to improve CNN evaluation throughput and efficiency. Importantly, the bandwidth demand from weight data transfer for modern large CNNs causes CNN accelerators to be severely bandwidth bottlenecked, prompting the need for processing images in batches to increase weight reuse. However, existing CNN accelerator designs limit the choice of batch sizes and lack support for batch processing of convolutional layers. We observe that, for a given storage budget, choosing the best batch size requires balancing the input and weight transfer. We propose Escher, a CNN accelerator with a flexible data buffering scheme that ensures a balance between the input and weight transfer bandwidth, significantly reducing overall bandwidth requirements. For example, compared to the state-of-the-art CNN accelerator designs targeting a Virtex-7 690T FPGA, Escher reduces the accelerator peak bandwidth requirements by 2.4x across both fully-connected and convolutional layers on fixed-point AlexNet, and reduces convolutional layer bandwidth by up to 10.5x on fixed-point GoogleNet.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121232190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High-Performance Hardware Merge Sorter","authors":"Susumu Mashimo, Thiem Van Chu, Kenji Kise","doi":"10.1109/FCCM.2017.19","DOIUrl":"https://doi.org/10.1109/FCCM.2017.19","url":null,"abstract":"State-of-the-art studies show that FPGA-based hardware merge sorters (HMSs) can achieve superior performance compared with optimized algorithms on CPUs and GPUs. The performance of any HMS is proportional to its operating frequency (F) and the number of records that can be output each cycle (E). However, all existing HMSs have a problem that F drops significantly with increasing E due to the increase of the number of levels of gates. In this paper, we propose novel architectures for HMSs where the number of levels of gates is constant when E is increased. We implement some HMSs adopting the proposed architectures on a Virtex-7 FPGA. The evaluation shows that an HMS of E = 32 operates at 311MHz and achieves 3.13x higher throughput than the state-of-the-art HMS.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114998600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bonded Force Computations on FPGAs","authors":"Qingqing Xiong, M. Herbordt","doi":"10.1109/FCCM.2017.49","DOIUrl":"https://doi.org/10.1109/FCCM.2017.49","url":null,"abstract":"While acceleration of Molecular Dynamics has received much attention, a significant part of that application, the bonded force calculation, has not. We present what we believe to be the first description and analysis of bonded force calculations outside of ASICs. We characterize the computational requirements. We find that a naive direct implementation requires FPGA resources out of proportion with its proportion of the workload. We investigate other options including various softcores and speed/area tradeoffs. These result in an assortment of solutions optimal for various combinations of problem and cluster size.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115231388","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
E. Bartz, J. Chaves, Y. Gershtein, E. Halkiadakis, M. Hildreth, S. Kyriacou, K. Lannon, A. Lefeld, A. Ryd, L. Skinnari, R. Stone, C. Strohman, Z. Tao, B. Winer, P. Wittich, Zhiru Zhang, M. Zientek
{"title":"FPGA-Based Real-Time Charged Particle Trajectory Reconstruction at the Large Hadron Collider","authors":"E. Bartz, J. Chaves, Y. Gershtein, E. Halkiadakis, M. Hildreth, S. Kyriacou, K. Lannon, A. Lefeld, A. Ryd, L. Skinnari, R. Stone, C. Strohman, Z. Tao, B. Winer, P. Wittich, Zhiru Zhang, M. Zientek","doi":"10.1109/FCCM.2017.27","DOIUrl":"https://doi.org/10.1109/FCCM.2017.27","url":null,"abstract":"The upgrades of the Compact Muon Solenoid particle physics experiment at CERN's Large Hadron Collider provide a major challenge for the real-time collision data selection. This paper presents a novel approach to pattern recognition and charged particle trajectory reconstruction using an all-FPGA solution. The challenges include a large input data rate of about 20 to 40~Tbps, processing a new batch of input data every 25~ns, each consisting of about 10,000 precise position measurements of particles ('stubs'), perform the pattern recognition on these stubs to find the trajectories, and produce the list of parameters describing these trajectories within 4~us. A proposed solution to this problem is described, in particular, the implementation of the pattern recognition and particle trajectory determination using an all-FPGA system. The results of an end-to-end demonstrator system based on Xilinx Virtex-7 FPGAs that meets timing and performance requirements are presented.","PeriodicalId":124631,"journal":{"name":"2017 IEEE 25th Annual International Symposium on Field-Programmable Custom Computing Machines (FCCM)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-03-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114227380","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}