{"title":"Session details: Session 8: Applications 2","authors":"Lesley Shannon","doi":"10.1145/3252943","DOIUrl":"https://doi.org/10.1145/3252943","url":null,"abstract":"","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124130927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Memory-Efficient Fast Fourier Transform on Streaming Data by Fusing Permutations","authors":"F. Serre, Markus Püschel","doi":"10.1145/3174243.3174263","DOIUrl":"https://doi.org/10.1145/3174243.3174263","url":null,"abstract":"We propose a novel FFT datapath that reduces the memory requirement compared to state-of-the-art RAM-based implementations by up to a factor of two. The novelty is in a technique to fuse the datapaths for the required perfect shuffle and bit reversal and is applicable to an entire design space of FFT implementations with varying degrees of reuse and number of input ports. We implemented a tool to generate this FFT design space for a given input size and to benchmark against prior work. The results show a reduction of half the RAM banks and/or half the logic complexity used for the permutations. The technique for fusing permutations is more generally applicable beyond the FFT.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122992018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A HOG-based Real-time and Multi-scale Pedestrian Detector Demonstration System on FPGA","authors":"Jan Dürre, Dario Paradzik, H. Blume","doi":"10.1145/3174243.3174249","DOIUrl":"https://doi.org/10.1145/3174243.3174249","url":null,"abstract":"Pedestrian detection will play a major role in future driver assistance and autonomous driving. One powerful algorithm in this field uses HOG features to describe the specific properties of pedestrians in images. To determine their locations, features are extracted and classified window-wise from different scales of an input image. The results of the classification are finally merged to remove overlapping detections. The real-time execution of this method requires specific FPGA- or ASIC-architectures. Recent work focused on accelerating the feature extraction and classification. Although merging is an important step in the algorithm, it is only rarely considered in hardware implementations. A reason for that could be its complexity and irregularity that is not trivial to implement in hardware. In this paper, we present a new bottom-up FPGA architecture that maps the full HOG-based algorithm for pedestrian detection including feature extraction, SVM classification, and multi-scale processing in combination with merging. For that purpose, we also propose a new hardware-optimized merging method. The resulting architecture is highly efficient. Additionally, we present an FPGA-based full real-time and multi-scale pedestrian detection demonstration system.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129004141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"K-Flow: A Programming and Scheduling Framework to Optimize Dataflow Execution on CPU-FPGA Platforms: (Abstract Only)","authors":"J. Cong, Zhenman Fang, Yao Hu, Di Wu","doi":"10.1145/3174243.3174968","DOIUrl":"https://doi.org/10.1145/3174243.3174968","url":null,"abstract":"With the slowing down of Moore's law, major cloud service providers---such as Amazon Web Services, Microsoft Azure, and Alibaba Cloud---all started deploying FPGAs in their cloud platforms to improve the performance and energy-efficiency. From the perspective of performance per unit cost in the cloud, it is essential to efficiently utilize all available CPU and FPGA resources within a requested computing instance. However, most prior studies overlook the CPU-FPGA co-optimization or require a considerable amount of manual efforts to achieve it. In this poster, we present a framework called K-Flow, which enables easy FPGA accelerator integration and efficient CPU-FPGA co-scheduling for big data applications. K-Flow abstracts an application as a widely used directed acyclic graph (DAG), and dynamically schedules a number of CPU threads and/or FPGA accelerator processing elements (PEs) to execute the dataflow tasks on each DAG node. Moreover, K-Flow provides user-friendly interfaces to program each DAG node and automates the tedious process of FPGA accelerator integration and CPU-FPGA co-optimization using the genomic read alignment application BWA-MEM as a case study. Experimental results show that K-Flow achieves a throughput that is on average 94.5% of the theoretical upper bound and 1.4x better than a straightforward FPGA integration.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126690864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Degree-aware Hybrid Graph Traversal on FPGA-HMC Platform","authors":"Jialiang Zhang, J. Li","doi":"10.1145/3174243.3174245","DOIUrl":"https://doi.org/10.1145/3174243.3174245","url":null,"abstract":"Graph traversal is a core primitive for graph analytics and a basis for many higher-level graph analysis methods. However, irregularities in the structure of scale-free graphs (e.g., social network) limit our ability to analyze these important and growing datasets. A key challenge is the redundant graph computations caused by the presence of high-degree vertices which not only increase the total amount of computations but also incur unnecessary random data access. In this paper, we present a graph processing system on an FPGA-HMC platform, based on software/hardware co-design and co- optimization. For the first time, we leverage the inherent graph property i.e. vertex degree to co-optimize algorithm and hardware architecture. In particular, we first develop two algorithm optimization techniques:degree-aware adjacency list reordering anddegree-aware vertex index sorting. The former can reduce the number of redundant graph computations, while the latter can create a strong correlation between vertex index and data access frequency, which can be effectively applied to guide the hardware design. We further implement the optimized hybrid graph traversal algorithm on an FPGA-HMC platform. By leveraging the strong correlation between vertex index and data access frequency made by degree-aware vertex index sorting, we develop two platform-dependent hardware optimization techniques, namely degree-aware data placement and degree-aware adjacency list compression. These two techniques together substantially reduce the amount of access to external memory. Finally, we conduct extensive experiments on an FPGA-HMC platform to verify the effectiveness of the proposed techniques. To the best of our knowledge, our implementation achieves the highest performance (45.8 billion traversed edges per second) among existing FPGA-based graph processing systems.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126261862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FastTrack: Exploiting Fast FPGA Wiring for Implementing NoC Shortcuts (Abstract Only)","authors":"Nachiket Kapre, T. Krishna","doi":"10.1145/3174243.3174962","DOIUrl":"https://doi.org/10.1145/3174243.3174962","url":null,"abstract":"The latency of packet-switched FPGA overlay Networks-on-Chip (NoCs) goes up linearly with the NoC dimensions, since packets typically spend a cycle in each dynamic router along the path. High-performance FPGA NoCs have to aggressively pipeline interconnects, thereby adding extra latency overhead to the NoC. The use of FPGA-friendly deflection routing schemes further exacerbates latency. Fortunately, FPGAs provide segmented interconnects with different lengths (speeds). Faster FPGA tracks can be used to reduce the number of switchbox hops along the packet path. We introduce FastTrack, an adaption to the NoC organization that inserts express bypass links in the NoC to skip multiple router stages in a single clock cycle. Our FastTrack design can be tuned to support different express link lengths for performance, and depopulation strategies for controlling cost. For the Xilinx Virtex-7 485T FPGA, an 8×8 FastTrack NoC is 2× larger than a base Hoplite NoC, but operates between 1.2-0.8× its clock frequency when using express links of length 2-4. FastTrack delivers throughput and latency improvements across a range of statistical workloads (2-2.5×), and traces extracted from FPGA accelerator case studies such as Sparse Matrix-Vector Multiplication (2.5×), Graph Analytics (2.8×), and Multi-processor overlay applications (2×). FastTrack also shows energy efficiency improvements by factors of up to 2× over baseline Hoplite due to higher sustained rates and high speed operation of express links made possible by fast FPGA interconnect.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116489855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yuan Zhou, Udit Gupta, Steve Dai, Ritchie Zhao, Nitish Kumar Srivastava, Hanchen Jin, Joseph Featherston, Yi-Hsiang Lai, Gai Liu, Gustavo Angarita Velasquez, Wenping Wang, Zhiru Zhang
{"title":"Rosetta: A Realistic High-Level Synthesis Benchmark Suite for Software Programmable FPGAs","authors":"Yuan Zhou, Udit Gupta, Steve Dai, Ritchie Zhao, Nitish Kumar Srivastava, Hanchen Jin, Joseph Featherston, Yi-Hsiang Lai, Gai Liu, Gustavo Angarita Velasquez, Wenping Wang, Zhiru Zhang","doi":"10.1145/3174243.3174255","DOIUrl":"https://doi.org/10.1145/3174243.3174255","url":null,"abstract":"Modern high-level synthesis (HLS) tools greatly reduce the turn-around time of designing and implementing complex FPGA-based accelerators. They also expose various optimization opportunities, which cannot be easily explored at the register-transfer level. With the increasing adoption of the HLS design methodology and continued advances of synthesis optimization, there is a growing need for realistic benchmarks to (1) facilitate comparisons between tools, (2) evaluate and stress-test new synthesis techniques, and (3) establish meaningful performance baselines to track progress of the HLS technology. While several HLS benchmark suites already exist, they are primarily comprised of small textbook-style function kernels, instead of complete and complex applications. To address this limitation, we introduce Rosetta, a realistic benchmark suite for software programmable FPGAs. Designs in Rosetta are fully-developed applications. They are associated with realistic performance constraints, and optimized with advanced features of modern HLS tools. We believe that Rosetta is not only useful for the HLS research community, but can also serve as a set of design tutorials for non-expert HLS users. In this paper we describe the characteristics of our benchmarks and the optimization techniques applied to them. We further report experimental results on an embedded FPGA device as well as a cloud FPGA platform.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"06 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129685489","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"DATuner: An Extensible Distributed Autotuning Framework for FPGA Design and Design Automation: (Abstract Only)","authors":"Gai Liu, Ecenur Ustun, Shaojie Xiang, Chang Xu, Guojie Luo, Zhiru Zhang","doi":"10.1145/3174243.3174978","DOIUrl":"https://doi.org/10.1145/3174243.3174978","url":null,"abstract":"Mainstream FPGA tools contain an extensive set of user-controlled compilation options and internal optimization strategies that significantly impact the design quality. These compilation and optimization parameters create a complex design space that human designers may not be able to effectively explore in a time-efficient manner. In this work we describe DATuner, an open-source extensible distributed autotuning framework for optimizing FPGA designs and design automation tools using an ensemble of search techniques managed by multi-armed bandit algorithms. DATuner is designed for a distributed environment that uses parallel searches to amortize the significant runtime overhead of the CAD tools. DATuner provides convenient interface for extension to user-supplied tools, which enables the end users to apply DATuner to design tools/flows of their interest. We demonstrate the effectiveness and extensibility of DATuner using three case studies, which include clock frequency optimization for FPGA compilation, fixed-point optimization, and autotuning logic synthesis transformations.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130665304","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"FASTCF: FPGA-based Accelerator for STochastic-Gradient-Descent-based Collaborative Filtering","authors":"Shijie Zhou, R. Kannan, Yu Min, V. Prasanna","doi":"10.1145/3174243.3174252","DOIUrl":"https://doi.org/10.1145/3174243.3174252","url":null,"abstract":"Sparse matrix factorization using Stochastic Gradient Descent (SGD) is a popular technique for deriving latent features from observations. SGD is widely used for Collaborative Filtering (CF), itself a well-known machine learning technique for recommender systems. In this paper, we develop an FPGA-based accelerator, FASTCF, to accelerate the SGD-based CF algorithm. FASTCF consists of parallel, pipelined processing units which concurrently process distinct user ratings by accessing a shared on-chip buffer. We design FASTCF through a holistic analysis of the specific design challenges for the acceleration of SGD-based CF on FPGA. Based on our analysis of these design challenges, we develop a bipartite graph processing approach with a novel 3-level hierarchical partitioning scheme that enables conflict-minimizing scheduling and processing of on-chip feature vector data to significantly accelerate the processing of this bipartite graph. First, we develop a fast heuristic to partition the input graph into induced subgraphs; this enables FASTCF to efficiently buffer vertex data for reuse and completely hide communication overhead. Second, we partition all the edges of each subgraph into matchings to extract the maximum parallelism. Third, we schedule the execution of the edges inside each matching to reduce concurrent memory access conflicts to the shared on-chip buffer. Compared with non-optimized baseline designs, the hierarchical partitioning approach results in up to 60x data dependency reduction, 4.2x bank conflict reduction, and 15.4x speedup. We implement FASTCF based on state-of-the-art FPGA and evaluate its performance using three large real-life datasets. Experimental results show that FASTCF sustains a high throughput of up to 217 billion floating-point operations per second (GFLOPS). Compared with state-of-the-art multi-core and GPU implementations, FASTCF demonstrates 13.3x and 12.7x speedup, respectively.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131994211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A FPGA Friendly Approximate Computing Framework with Hybrid Neural Networks: (Abstract Only)","authors":"Haiyue Song, Xiang Song, Tianjian Li, Hao Dong, Naifeng Jing, Xiaoyao Liang, Li Jiang","doi":"10.1145/3174243.3174965","DOIUrl":"https://doi.org/10.1145/3174243.3174965","url":null,"abstract":"Neural approximate computing is promising to gain energy-efficiency at the cost of tolerable quality loss. The architecture contains two neural networks: the approximate accelerator generates approximate results while the classifier determines whether input data can be safely approximated. However, they are not compatible to a heterogeneous computing platform, due to the large communication overhead between the approximate accelerator and accurate cores, and the large speed gap between them. This paper proposes a software-hardware co-design strategy. With deep exploration of data distributions in the feature space, we first propose a novel approximate computing architecture containing a multi-class classifier and multiple approximate accelerator; this architecture, derived by the existing iterative co-training methods, can shift more data from accurate computation (in CPU) to approximate accelerator (in FPGA); the increased invocation of the approximate accelerator thus can yield higher utilization of the FPGA-based accelerator, resulting in the enhanced the performance. Moreover, much less input data is redistributed, by the classifier (also in FPGA), back to CPU, which can minimize the CPU-FPGA communication. Second, we design a pipelined data-path with batched input/output for the proposed hybrid architecture to efficiently hide the communication latency. A mask technique is proposed to decouple the synchronization between CPU and FPGA, in order to minimize the frequency of communication.","PeriodicalId":164936,"journal":{"name":"Proceedings of the 2018 ACM/SIGDA International Symposium on Field-Programmable Gate Arrays","volume":"156 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114476551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}