Dustin Feld, T. Soddemann, M. Jünger, Sven Mallach
{"title":"Hardware-Aware Automatic Code-Transformation to Support Compilers in Exploiting the Multi-Level Parallel Potential of Modern CPUs","authors":"Dustin Feld, T. Soddemann, M. Jünger, Sven Mallach","doi":"10.1145/2723772.2723776","DOIUrl":"https://doi.org/10.1145/2723772.2723776","url":null,"abstract":"Modern compilers offer more and more capabilities to automatically parallelize code-regions if these match certain properties. However, there are several application kernels that, although rather simple transformations would suffice in order to make them match these properties, are either not at all parallelized by state-of-the-art compilers or could at least be improved w.r.t. their performance. This paper proposes a loop-tiling approach focusing on automatic vectorization and multi-core parallelization, with emphasis on a smart cache exploitation. The method is based on polyhedral code transformations that are applied as a pre-compilation step and it is shown to help compilers in generating more and better parallel code-regions. It automatically adapts to hardware parameters such as the SIMD register width and cache sizes. Further, it takes memory-access patterns into account and is capable to minimize communication among tiles that are to be processed by different cores. An extensive computational study shows significant improvements in the number of instructions vectorized, cache miss rates, and running times for a range of application kernels. The method often outperforms the internal auto-parallelization techniques implemented into gcc and icc.","PeriodicalId":350480,"journal":{"name":"Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121183145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bo Zhao, Zhen Li, A. Jannesari, F. Wolf, Weiguo Wu
{"title":"Dependence-Based Code Transformation for Coarse-Grained Parallelism","authors":"Bo Zhao, Zhen Li, A. Jannesari, F. Wolf, Weiguo Wu","doi":"10.1145/2723772.2723777","DOIUrl":"https://doi.org/10.1145/2723772.2723777","url":null,"abstract":"Multicore architectures are becoming more common today. Many software products implemented sequentially have failed to exploit the potential parallelism of multicore architectures. Significant re-engineering and refactoring of existing software is needed to support the use of new hardware features. Due to the high cost of manual transformation, an automated approach to transforming existing software and taking advantage of multicore architectures would be highly beneficial. We propose a novel auto-parallelization approach, which integrates data-dependence profiling, task parallelism extraction and source-to-source transformation. Coarse-grained task parallelism is detected based on a concept called Computational Unit(CU). We use dynamic profiling information to gather control- and data-dependences among tasks and generate a task graph. In addition, we develop a source-to-source transformation tool based on LLVM, which can perform high-level code restructuring. It transforms the generated task graph with loop parallelism and task parallelism of sequential code into parallel code using Intel Threading Building Blocks (TBB). We have evaluated NAS Parallel Benchmark applications, three applications from PARSEC benchmark suite, and real world applications. The obtained results confirm that our approach is able to achieve promising performance with minor user interference. The average speedups of loop parallelization and task parallelization are 3.12x and 9.92x respectively.","PeriodicalId":350480,"journal":{"name":"Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores","volume":"574 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123169855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hamza Chaker, Loïc Cudennec, Safae Dahmani, G. Gogniat, Martha Johanna Sepúlveda
{"title":"Cycle-based Model to Evaluate Consistency Protocols within a Multi-protocol Compilation Tool-chain","authors":"Hamza Chaker, Loïc Cudennec, Safae Dahmani, G. Gogniat, Martha Johanna Sepúlveda","doi":"10.1145/2723772.2723779","DOIUrl":"https://doi.org/10.1145/2723772.2723779","url":null,"abstract":"Many-core processors are made by hundreds to thousands cores, distributed memories and a dedicated network on a single chip. In this context, and because of the scale of the processor, providing a shared memory system has to rely on efficient hardware mechanisms and/or data consistency protocols. Some works explored several consistency mechanisms designed for many-core processors. They lead to the conclusion that there won't exist one protocol that fits to all applications and hardware contexts. Therefore, it sounds relevant to use a multi-protocol platform, in which shared data of the application can be managed by different protocols. Protocols are chosen and configured at compile time, following a static analysis of the application and the profiling of memory accesses. In this work, we propose a high-level timed model that we use to evaluate, at compile time, the consistency protocol which has been assigned to a given application and a given Network-on-Chip (NoC). This model allows to calculate the number of NoC cycles needed for each data access, that can be turned into mean access cycles for each core or each shared data. The model is not as accurate as a cycle-based NoC simulator or an instruction set simulator. However, it is accurate enough to evaluate the impact of choosing and configuring a protocol, and its lightweight implementation allows to run within an operational research optimization loop. To validate our approach, we apply the model to compare three consistency protocols, on a 2D mesh network, compiling a parallel convolution application.","PeriodicalId":350480,"journal":{"name":"Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115843387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Basic Building Blocks of Parallel Tasks","authors":"Rohit Atre, A. Jannesari, F. Wolf","doi":"10.1145/2723772.2723778","DOIUrl":"https://doi.org/10.1145/2723772.2723778","url":null,"abstract":"Discovery of parallelization opportunities in sequential programs can greatly reduce the time and effort required to parallelize any application. Identification and analysis of code that contains little to no internal parallelism can also help expose potential parallelism. This paper provides a technique to identify a block of code called Computational Unit (CU) that performs a unit of work in a program. A CU can assist in discovering the potential parallelism in a sequential program by acting as a basic building block for tasks. CUs are used along with dynamic analysis information to identify the tasks that contain tightly coupled code within them. This process in turn reveals the tasks that are weakly dependent or independent. The independent tasks can be run in parallel and the dependent tasks can be analyzed to check if the dependences can be resolved. To evaluate our technique, different benchmark applications are parallelized using our identified tasks and the speedups are reported. In addition, existing parallel implementations of the applications are compared with the identified tasks for the respective applications.","PeriodicalId":350480,"journal":{"name":"Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125571115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Evaluation of Memory Sharing Performance for Heterogeneous Embedded SoCs with Many-Core Accelerators","authors":"Pirmin Vogel, A. Marongiu, L. Benini","doi":"10.1145/2723772.2723775","DOIUrl":"https://doi.org/10.1145/2723772.2723775","url":null,"abstract":"Today's systems-on-chip (SoCs) more and more conform to the models envisioned by the Heterogeneous System Architecture (HSA) foundation in which massively parallel, programmable many-core accelerators (PMCAs) not only cooperate but also coherently share memory with a powerful, multi-core host processor. Allowing direct access to system memory from both sides greatly simplifies application development, but it increases the potential interference to the memory system due to the PMCA. In this work, we evaluate the impact of a PMCA's memory traffic on the host performance using the Xilinx Zynq-7000 SoC. This platform features a dual-core ARM Cortex-A9 CPU, as well as a field-programmable gate array (FPGA), which we use to model a PMCA. Synthetic workload, real benchmarks from the MiBench and ALPBench suites, and collaborative workloads all show that the interference generated by the PMCA can significantly reduce the memory bandwidth seen by the host (on average up to 25 % for host applications).","PeriodicalId":350480,"journal":{"name":"Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores","volume":"344 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115675018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alessandro Capotondi, Germain Haugou, A. Marongiu, L. Benini
{"title":"Runtime Support for Multiple Offload-Based Programming Models on Embedded Manycore Accelerators","authors":"Alessandro Capotondi, Germain Haugou, A. Marongiu, L. Benini","doi":"10.1145/2723772.2723773","DOIUrl":"https://doi.org/10.1145/2723772.2723773","url":null,"abstract":"Many modern high-end embedded systems are designed as heterogeneous systems-on-chip (SoCs), where a powerful general purpose multicore host processor is coupled to a manycore accelerator. The host executes legacy applications on top of standard operating systems, while the accelerator runs highly parallel code kernels within those applications. Several programming models are currently being proposed to program such accelerator-based systems, OpenCL and OpenMP being the most relevant examples. In the near future it will be common to have multiple applications, coded with different programming models, concurrently requiring the use of the manycore accelerator. In this paper we present a runtime system for a cluster-based manycore accelerator, optimized for the concurrent execution of OpenMP and OpenCL kernels. The runtime supports spatial partitioning of the manycore, where clusters can be grouped into several \"virtual\" accelerator instances. Our runtime design is modular and relies on a \"generic\" component for resource (cluster) scheduling, plus \"specialized\" components which efficiently deploy generic offload requests into an implementation of the target programming model's semantics. We evaluate the proposed runtime system on a real heterogeneous system, the STMicroelectronics STHORM development board.","PeriodicalId":350480,"journal":{"name":"Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129333097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Roadmap for a Type Architecture Based Parallel Programming Language","authors":"Muhammad N. Yanhaona, A. Grimshaw","doi":"10.1145/2723772.2723774","DOIUrl":"https://doi.org/10.1145/2723772.2723774","url":null,"abstract":"Ever since the end of the era of single processor performance improvement, we observe proliferation of multi and many-core architectures in almost all spheres of computing. Tablets, desktop PCs, workstation clusters, and supercomputers are ripe with multi-core CPUs and/or accelerators. Although there are considerable architectural heterogeneity and scale differences present among these machine architectures, there is also significant commonality in their multi-core building blocks. This situation revives the interest for possibility of parallel programming paradigms that are both efficient and portable across environments. We have been investigating such a programming paradigm for over a year and half now. Two key aspects of our work are development of a common machine abstraction for diverse hardware platforms and a clear separation of different concerns embodied in parallel programming. This is our first paper describing the philosophy, abstraction mechanism, programming model, and early results of our ongoing project.","PeriodicalId":350480,"journal":{"name":"Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129638372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Exploiting Dynamic Parallelism to Efficiently Support Irregular Nested Loops on GPUs","authors":"Da Li, Hancheng Wu, M. Becchi","doi":"10.1145/2723772.2723780","DOIUrl":"https://doi.org/10.1145/2723772.2723780","url":null,"abstract":"Graphics Processing Units (GPUs) have been used in general purpose computing for several years. The newly introduced Dynamic Parallelism feature of Nvidia's Kepler GPUs allows launching kernels from the GPU directly. However, the naïve use of this feature can cause a high number of nested kernel launches, each performing limited work, leading to GPU underutilization and poor performance. We propose workload consolidation mechanisms at different granularities to maximize the work performed by nested kernels and reduce their overhead. Our end goal is to design automatic code transformation techniques for applications with irregular nested loops.","PeriodicalId":350480,"journal":{"name":"Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129668954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores","authors":"","doi":"10.1145/2723772","DOIUrl":"https://doi.org/10.1145/2723772","url":null,"abstract":"","PeriodicalId":350480,"journal":{"name":"Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133559546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}