ANDARE '17Pub Date : 2017-09-09DOI: 10.1145/3152821.3152879
Andreas Diavastos, P. Trancoso
{"title":"Auto-tuning Static Schedules for Task Data-flow Applications","authors":"Andreas Diavastos, P. Trancoso","doi":"10.1145/3152821.3152879","DOIUrl":"https://doi.org/10.1145/3152821.3152879","url":null,"abstract":"Scheduling task-based parallel applications on many-core processors is becoming more challenging and has received lots of attention recently. The main challenge is to efficiently map the tasks to the underlying hardware topology using application characteristics such as the dependences between tasks, in order to satisfy the requirements. To achieve this, each application must be studied exhaustively as to define the usage of the data by the different tasks, that would provide the knowledge for mapping tasks that share the same data close to each other. In addition, different hardware topologies will require different mappings for the same application to produce the best performance.\u0000 In this work we use the synchronization graph of a task-based parallel application that is produced during compilation and try to automatically tune the scheduling policy on top of any underlying hardware using heuristic-based Genetic Algorithm techniques. This tool is integrated into an actual task-based parallel programming platform called SWITCHES and is evaluated using real applications from the SWITCHES benchmark suite. We compare our results with the execution time of predefined schedules within SWITCHES and observe that the tool can converge close to an optimal solution with no effort from the user and using fewer resources.","PeriodicalId":227417,"journal":{"name":"ANDARE '17","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129992965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ANDARE '17Pub Date : 2017-09-09DOI: 10.1145/3152821.3152877
J. Filipovič, Filip Petrovic, S. Benkner
{"title":"Autotuning of OpenCL Kernels with Global Optimizations","authors":"J. Filipovič, Filip Petrovic, S. Benkner","doi":"10.1145/3152821.3152877","DOIUrl":"https://doi.org/10.1145/3152821.3152877","url":null,"abstract":"Autotuning is an important method for automatically exploring code optimizations. It may target low-level code optimizations, such as memory blocking, loop unrolling or memory prefetching, as well as high-level optimizations, such as placement of computation kernels on proper hardware devices, optimizing memory transfers between nodes or between accelerators and main memory.\u0000 In this paper, we introduce an autotuning method, which extends state-of-the-art low-level tuning of OpenCL or CUDA kernels towards more complex optimizations. More precisely, we introduce a Kernel Tuning Toolkit (KTT), which implements inter-kernel global optimizations, allowing to tune parameters affecting multiple kernels or also the host code. We demonstrate on practical examples, that with global kernel optimizations we are able to explore tuning options that are not possible if kernels are tuned separately. Moreover, our tuning strategies can take into account numerical accuracy across multiple kernel invocations and search for implementations within specific numerical error bounds.","PeriodicalId":227417,"journal":{"name":"ANDARE '17","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127078429","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ANDARE '17Pub Date : 2017-09-09DOI: 10.1145/3152821.3152822
Francesco Paterna, U. Gupta, R. Ayoub, Ümit Y. Ogras, M. Kishinevsky
{"title":"Adaptive Performance Sensitivity Model to Support GPU Power Management","authors":"Francesco Paterna, U. Gupta, R. Ayoub, Ümit Y. Ogras, M. Kishinevsky","doi":"10.1145/3152821.3152822","DOIUrl":"https://doi.org/10.1145/3152821.3152822","url":null,"abstract":"Integrated graphics units consume a large portion of power in client and mobile systems. Pro-active power management algorithms have been devised to meet expected user experience while reducing energy consumption. These techniques often rely on power and performance sensitivity models that are constructed at design phase using a number of workloads. Despite this, the lack of representative workloads and model identification overhead adversely impact accuracy and development time, respectively. Conversely, two main challenges limit runtime post-design identification: the absence of sensitivity feedback from the system and the limited computational resources. We propose a two-stage methodology that first identifies the features of the sensitivity model offline by leveraging a reduced amount of training data and then uses recursive least square algorithm to fit and adapt the coefficients of the model to workload changes at runtime. The proposed adaptive approach can reduce offline training data by 50% with respect to full offline model identification while maintaining accuracy as much as 95% on average.","PeriodicalId":227417,"journal":{"name":"ANDARE '17","volume":"302 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114950826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ANDARE '17Pub Date : 2017-09-09DOI: 10.1145/3152821.3152878
Daniel Cesarini, Andrea Bartolini, L. Benini
{"title":"Benefits in Relaxing the Power Capping Constraint","authors":"Daniel Cesarini, Andrea Bartolini, L. Benini","doi":"10.1145/3152821.3152878","DOIUrl":"https://doi.org/10.1145/3152821.3152878","url":null,"abstract":"In this manuscript we evaluate the impact of HW power capping mechanisms on a real scientific application composed by parallel execution. By comparing HW capping mechanism against static frequency allocation schemes we show that a speed up can be achieved if the power constraint is enforced in average, during the application run, instead of on short time periods. RAPL, which enforces the power constraint on a few ms time scale, fails on sharing power budget between more demanding and less demanding application phases.","PeriodicalId":227417,"journal":{"name":"ANDARE '17","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127597628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
ANDARE '17Pub Date : 2017-09-09DOI: 10.1145/3152821.3152880
Jaume Bosch, Antonio Filgueras, Miquel Vidal Piñol, Daniel Jiménez-González, C. Álvarez, X. Martorell
{"title":"Exploiting Parallelism on GPUs and FPGAs with OmpSs","authors":"Jaume Bosch, Antonio Filgueras, Miquel Vidal Piñol, Daniel Jiménez-González, C. Álvarez, X. Martorell","doi":"10.1145/3152821.3152880","DOIUrl":"https://doi.org/10.1145/3152821.3152880","url":null,"abstract":"This paper presents the OmpSs approach to deal with heterogeneous programming on GPU and FPGA accelerators. The OmpSs programming model is based on the Mercurium compiler and the Nanos++ runtime. Applications are annotated with compiler directives specifying task-based parallelism. The Mercurium compiler transforms the code to exploit the parallelism in the SMP host cores, and also to spawn work on CUDA/OpenCL devices, and FPGA accelerators. For the CUDA/OpenCL devices, the programmer needs only to insert the annotations and provide the kernel function to be compiled by the native CUDA/OpenCL compiler. In the case of the FPGAs, OmpSs uses the High-Level Synthesis tools from FPGA vendors to generate the IP configurations for the FPGA. In this paper we present the performance obtained on the matrix multiply benchmark in the Xilinx Zynq Ultrascale+, as a result of using OmpSs on this benchmark.","PeriodicalId":227417,"journal":{"name":"ANDARE '17","volume":"110 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117223110","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}