ANDARE '18Pub Date : 2018-11-04DOI: 10.1145/3295816.3295817
D. Střelák, J. Filipovič
{"title":"Performance analysis and autotuning setup of the cuFFT library","authors":"D. Střelák, J. Filipovič","doi":"10.1145/3295816.3295817","DOIUrl":"https://doi.org/10.1145/3295816.3295817","url":null,"abstract":"Fast Fourier transform (FFT) has many applications. It is often one of the most computationally demanding kernels, so a lot of attention has been invested into tuning its performance on various hardware devices. However, FFT libraries have usually many possible settings and it is not always easy to deduce which settings should be used for optimal performance. In practice, we can often slightly modify the FFT settings, for example, we can pad or crop input data. Surprisingly, a majority of state-of-the-art papers focus to answer the question how to implement FFT under given settings but do not pay much attention to the question which settings result in the fastest computation.\u0000 In this paper, we target a popular implementation of FFT for GPU accelerators, the cuFFT library. We analyze the behavior and the performance of the cuFFT library with respect to input sizes and plan settings. We also present a new tool, cuFFTAdvisor, which proposes and by means of autotuning finds the best configuration of the library for given constraints of input size and plan settings.\u0000 We experimentally show that our tool is able to propose different settings of the transformation, resulting in an average 6x speedup using fast heuristics and 6.9x speedup using autotuning.","PeriodicalId":280329,"journal":{"name":"ANDARE '18","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128033880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ANDARE '18Pub Date : 2018-11-04DOI: 10.1145/3295816.3295818
Daniel Cesarini, Andrea Bartolini, P. Bonfà, C. Cavazzoni, L. Benini
{"title":"COUNTDOWN: a run-time library for application-agnostic energy saving in MPI communication primitives","authors":"Daniel Cesarini, Andrea Bartolini, P. Bonfà, C. Cavazzoni, L. Benini","doi":"10.1145/3295816.3295818","DOIUrl":"https://doi.org/10.1145/3295816.3295818","url":null,"abstract":"Energy and power consumption are prominent issues in today's supercomputers and are foreseen as a limiting factor of future installations. In scientific computing, a significant amount of power is spent in the communication and synchronization-related idle times among distributed processes participating to the same application. However, due to the time scale at which communication happens, taking advantage of low-power states to reduce power in idle times in the computing resources, may introduce significant overheads.\u0000 In this paper we present COUNTDOWN, a methodology and a tool for identifying and automatically reducing the frequency of the computing elements in order to save energy during communication and synchronization primitives. COUNTDOWN is able to filter out phases which would detriment the time to solution of the application transparently to the user, without touching the application code nor requiring recompilation of the application. We tested our methodology in a production Tier-0 system, a production application - Quantum ESPRESSO (QE) - with production datasets which can scale up to 3.5K cores. Experimental results show that our methodology saves 22.36% of energy consumption with a performance penalty of 2.88% in real production MPI-based application.","PeriodicalId":280329,"journal":{"name":"ANDARE '18","volume":"3 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113962661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ANDARE '18Pub Date : 2018-11-04DOI: 10.1145/3295816.3295820
Alexandre Vieira, F. Pratas, L. Sousa, A. Ilic
{"title":"Accelerating CNN computation: quantisation tuning and network resizing","authors":"Alexandre Vieira, F. Pratas, L. Sousa, A. Ilic","doi":"10.1145/3295816.3295820","DOIUrl":"https://doi.org/10.1145/3295816.3295820","url":null,"abstract":"The interest in developing cognitive aware systems, specially for vision applications based on artificial neural networks, has grown exponentially in the last years. While high performance systems are key for the success of current Convolutional Neural Network (CNN) implementations, there is a trend to bring these capabilities to embedded real-time systems. This work contributes to tackle this challenge by exploring CNNs design space. Namely, it combines parameter quantisation techniques with a proposed set of CNN architectural transformations to reduce resource and execution time costs on Field Programmable Gate Array (FPGA) devices while maintaining high classification accuracy. An hardware mapping methodology is also proposed for deploying resource constrained CNNs into a reconfigurable platform for efficient algorithm acceleration. The proposed transformations reduce accuracy loss due to quantization by 44% in average. Also, analysis of the performance results obtained in a Central Processing Unit (CPU)+FPGA platform show up to 50% execution time reduction when compared with a state-of-the-art implementation.","PeriodicalId":280329,"journal":{"name":"ANDARE '18","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127315842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ANDARE '18Pub Date : 2018-11-04DOI: 10.1145/3295816.3295821
Andreas Kurth, Alessandro Capotondi, Pirmin Vogel, L. Benini, A. Marongiu
{"title":"HERO: an open-source research platform for HW/SW exploration of heterogeneous manycore systems","authors":"Andreas Kurth, Alessandro Capotondi, Pirmin Vogel, L. Benini, A. Marongiu","doi":"10.1145/3295816.3295821","DOIUrl":"https://doi.org/10.1145/3295816.3295821","url":null,"abstract":"Heterogeneous systems on chip (HeSoCs) co-integrate a high-performance multicore host processor with programmable manycore accelerators (PMCAs) to combine \"standard platform\" software support (e.g. the Linux OS) with energy-efficient, domain-specific, highly parallel processing capabilities.\u0000 In this work, we present HERO, a HeSoC platform that tackles this challenge in a novel way. HERO's host processor is an industry-standard ARM Cortex-A multicore complex, while its PMCA is a scalable, silicon-proven, open-source many-core processing engine, based on the extensible, open RISC-V ISA.\u0000 We evaluate a prototype implementation of HERO, where the PMCA implemented on an FPGA fabric is coupled with a hard ARM Cortex-A host processor, and show that the run time overhead compared to manually written PMCA code operating on private physical memory is lower than 10 % for pivotal benchmarks and operating conditions.","PeriodicalId":280329,"journal":{"name":"ANDARE '18","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126933914","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ANDARE '18Pub Date : 2018-11-04DOI: 10.1145/3295816.3295819
Antonio Libri, Andrea Bartolini, Daniel Cesarini, L. Benini
{"title":"Evaluation of NTP/PTP fine-grain synchronization performance in HPC clusters","authors":"Antonio Libri, Andrea Bartolini, Daniel Cesarini, L. Benini","doi":"10.1145/3295816.3295819","DOIUrl":"https://doi.org/10.1145/3295816.3295819","url":null,"abstract":"Fine-grain time synchronization is important to address several challenges in today and future High Performance Computing (HPC) centers. Among the many, (i) co-scheduling techniques in parallel applications with sensitive bulk synchronous workloads, (ii) performance analysis tools and (iii) autotuning strategies that want to exploit State-of-the-Art (SoA) high resolution monitoring systems, are three examples where synchronization of few microseconds is required. Previous works report custom solutions to reach this performance without incurring in extra cost of dedicated hardware. On the other hand, the benefits to use robust standards which are widely supported by the community, such as Network Time Protocol (NTP) and Precision Time Protocol (PTP), are evident. With today's software and hardware improvements of these two protocols and off-the-shelf integration in SoA HPC servers no expensive extra hardware is required anymore, but an evaluation of their performance in supercomputing clusters is needed. Our results show NTP can reach on computing nodes an accuracy of 2.6 μs and a precision below 2.7 μs, with negligible overhead. These values can be bounded below microseconds, with PTP and low-cost switches (no needs of GPS antenna). Both protocols are also suitable for data time-stamping in SoA HPC monitoring infrastructures. We validate their performance with two real use-cases, and quantify scalability and CPU overhead. Finally, we report software settings and low-cost network configuration to reach these high precision synchronization results.","PeriodicalId":280329,"journal":{"name":"ANDARE '18","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115717007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}