ANDARE '18最新文献

Performance analysis and autotuning setup of the cuFFT library cuFFT库的性能分析和自动调优设置

ANDARE '18 Pub Date : 2018-11-04 DOI: 10.1145/3295816.3295817

D. Střelák, J. Filipovič

{"title":"Performance analysis and autotuning setup of the cuFFT library","authors":"D. Střelák, J. Filipovič","doi":"10.1145/3295816.3295817","DOIUrl":"https://doi.org/10.1145/3295816.3295817","url":null,"abstract":"Fast Fourier transform (FFT) has many applications. It is often one of the most computationally demanding kernels, so a lot of attention has been invested into tuning its performance on various hardware devices. However, FFT libraries have usually many possible settings and it is not always easy to deduce which settings should be used for optimal performance. In practice, we can often slightly modify the FFT settings, for example, we can pad or crop input data. Surprisingly, a majority of state-of-the-art papers focus to answer the question how to implement FFT under given settings but do not pay much attention to the question which settings result in the fastest computation.\u0000 In this paper, we target a popular implementation of FFT for GPU accelerators, the cuFFT library. We analyze the behavior and the performance of the cuFFT library with respect to input sizes and plan settings. We also present a new tool, cuFFTAdvisor, which proposes and by means of autotuning finds the best configuration of the library for given constraints of input size and plan settings.\u0000 We experimentally show that our tool is able to propose different settings of the transformation, resulting in an average 6x speedup using fast heuristics and 6.9x speedup using autotuning.","PeriodicalId":280329,"journal":{"name":"ANDARE '18","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128033880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

COUNTDOWN: a run-time library for application-agnostic energy saving in MPI communication primitives 一个运行时库，用于MPI通信原语中与应用程序无关的节能

ANDARE '18 Pub Date : 2018-11-04 DOI: 10.1145/3295816.3295818

Daniel Cesarini, Andrea Bartolini, P. Bonfà, C. Cavazzoni, L. Benini

{"title":"COUNTDOWN: a run-time library for application-agnostic energy saving in MPI communication primitives","authors":"Daniel Cesarini, Andrea Bartolini, P. Bonfà, C. Cavazzoni, L. Benini","doi":"10.1145/3295816.3295818","DOIUrl":"https://doi.org/10.1145/3295816.3295818","url":null,"abstract":"Energy and power consumption are prominent issues in today's supercomputers and are foreseen as a limiting factor of future installations. In scientific computing, a significant amount of power is spent in the communication and synchronization-related idle times among distributed processes participating to the same application. However, due to the time scale at which communication happens, taking advantage of low-power states to reduce power in idle times in the computing resources, may introduce significant overheads.\u0000 In this paper we present COUNTDOWN, a methodology and a tool for identifying and automatically reducing the frequency of the computing elements in order to save energy during communication and synchronization primitives. COUNTDOWN is able to filter out phases which would detriment the time to solution of the application transparently to the user, without touching the application code nor requiring recompilation of the application. We tested our methodology in a production Tier-0 system, a production application - Quantum ESPRESSO (QE) - with production datasets which can scale up to 3.5K cores. Experimental results show that our methodology saves 22.36% of energy consumption with a performance penalty of 2.88% in real production MPI-based application.","PeriodicalId":280329,"journal":{"name":"ANDARE '18","volume":"3 6","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113962661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Accelerating CNN computation: quantisation tuning and network resizing 加速CNN计算:量化调整和网络大小调整

ANDARE '18 Pub Date : 2018-11-04 DOI: 10.1145/3295816.3295820

Alexandre Vieira, F. Pratas, L. Sousa, A. Ilic

引用次数: 0

HERO: an open-source research platform for HW/SW exploration of heterogeneous manycore systems HERO:用于异构多核系统的硬件/软件探索的开源研究平台

ANDARE '18 Pub Date : 2018-11-04 DOI: 10.1145/3295816.3295821

Andreas Kurth, Alessandro Capotondi, Pirmin Vogel, L. Benini, A. Marongiu

引用次数: 14

Evaluation of NTP/PTP fine-grain synchronization performance in HPC clusters HPC集群中NTP/PTP细粒度同步性能评价

ANDARE '18 Pub Date : 2018-11-04 DOI: 10.1145/3295816.3295819

Antonio Libri, Andrea Bartolini, Daniel Cesarini, L. Benini

{"title":"Evaluation of NTP/PTP fine-grain synchronization performance in HPC clusters","authors":"Antonio Libri, Andrea Bartolini, Daniel Cesarini, L. Benini","doi":"10.1145/3295816.3295819","DOIUrl":"https://doi.org/10.1145/3295816.3295819","url":null,"abstract":"Fine-grain time synchronization is important to address several challenges in today and future High Performance Computing (HPC) centers. Among the many, (i) co-scheduling techniques in parallel applications with sensitive bulk synchronous workloads, (ii) performance analysis tools and (iii) autotuning strategies that want to exploit State-of-the-Art (SoA) high resolution monitoring systems, are three examples where synchronization of few microseconds is required. Previous works report custom solutions to reach this performance without incurring in extra cost of dedicated hardware. On the other hand, the benefits to use robust standards which are widely supported by the community, such as Network Time Protocol (NTP) and Precision Time Protocol (PTP), are evident. With today's software and hardware improvements of these two protocols and off-the-shelf integration in SoA HPC servers no expensive extra hardware is required anymore, but an evaluation of their performance in supercomputing clusters is needed. Our results show NTP can reach on computing nodes an accuracy of 2.6 μs and a precision below 2.7 μs, with negligible overhead. These values can be bounded below microseconds, with PTP and low-cost switches (no needs of GPS antenna). Both protocols are also suitable for data time-stamping in SoA HPC monitoring infrastructures. We validate their performance with two real use-cases, and quantify scalability and CPU overhead. Finally, we report software settings and low-cost network configuration to reach these high precision synchronization results.","PeriodicalId":280329,"journal":{"name":"ANDARE '18","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115717007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10