Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores最新文献

Hardware-Aware Automatic Code-Transformation to Support Compilers in Exploiting the Multi-Level Parallel Potential of Modern CPUs 支持编译器开发现代cpu多级并行潜能的硬件感知自动代码转换

Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores Pub Date : 2015-02-08 DOI: 10.1145/2723772.2723776

Dustin Feld, T. Soddemann, M. Jünger, Sven Mallach

{"title":"Hardware-Aware Automatic Code-Transformation to Support Compilers in Exploiting the Multi-Level Parallel Potential of Modern CPUs","authors":"Dustin Feld, T. Soddemann, M. Jünger, Sven Mallach","doi":"10.1145/2723772.2723776","DOIUrl":"https://doi.org/10.1145/2723772.2723776","url":null,"abstract":"Modern compilers offer more and more capabilities to automatically parallelize code-regions if these match certain properties. However, there are several application kernels that, although rather simple transformations would suffice in order to make them match these properties, are either not at all parallelized by state-of-the-art compilers or could at least be improved w.r.t. their performance. This paper proposes a loop-tiling approach focusing on automatic vectorization and multi-core parallelization, with emphasis on a smart cache exploitation. The method is based on polyhedral code transformations that are applied as a pre-compilation step and it is shown to help compilers in generating more and better parallel code-regions. It automatically adapts to hardware parameters such as the SIMD register width and cache sizes. Further, it takes memory-access patterns into account and is capable to minimize communication among tiles that are to be processed by different cores. An extensive computational study shows significant improvements in the number of instructions vectorized, cache miss rates, and running times for a range of application kernels. The method often outperforms the internal auto-parallelization techniques implemented into gcc and icc.","PeriodicalId":350480,"journal":{"name":"Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121183145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Dependence-Based Code Transformation for Coarse-Grained Parallelism 基于依赖的粗粒度并行代码转换

Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores Pub Date : 2015-02-08 DOI: 10.1145/2723772.2723777

Bo Zhao, Zhen Li, A. Jannesari, F. Wolf, Weiguo Wu

{"title":"Dependence-Based Code Transformation for Coarse-Grained Parallelism","authors":"Bo Zhao, Zhen Li, A. Jannesari, F. Wolf, Weiguo Wu","doi":"10.1145/2723772.2723777","DOIUrl":"https://doi.org/10.1145/2723772.2723777","url":null,"abstract":"Multicore architectures are becoming more common today. Many software products implemented sequentially have failed to exploit the potential parallelism of multicore architectures. Significant re-engineering and refactoring of existing software is needed to support the use of new hardware features. Due to the high cost of manual transformation, an automated approach to transforming existing software and taking advantage of multicore architectures would be highly beneficial. We propose a novel auto-parallelization approach, which integrates data-dependence profiling, task parallelism extraction and source-to-source transformation. Coarse-grained task parallelism is detected based on a concept called Computational Unit(CU). We use dynamic profiling information to gather control- and data-dependences among tasks and generate a task graph. In addition, we develop a source-to-source transformation tool based on LLVM, which can perform high-level code restructuring. It transforms the generated task graph with loop parallelism and task parallelism of sequential code into parallel code using Intel Threading Building Blocks (TBB). We have evaluated NAS Parallel Benchmark applications, three applications from PARSEC benchmark suite, and real world applications. The obtained results confirm that our approach is able to achieve promising performance with minor user interference. The average speedups of loop parallelization and task parallelization are 3.12x and 9.92x respectively.","PeriodicalId":350480,"journal":{"name":"Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores","volume":"574 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123169855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Cycle-based Model to Evaluate Consistency Protocols within a Multi-protocol Compilation Tool-chain 基于循环的多协议编译工具链一致性评估模型

Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores Pub Date : 2015-02-08 DOI: 10.1145/2723772.2723779

Hamza Chaker, Loïc Cudennec, Safae Dahmani, G. Gogniat, Martha Johanna Sepúlveda

{"title":"Cycle-based Model to Evaluate Consistency Protocols within a Multi-protocol Compilation Tool-chain","authors":"Hamza Chaker, Loïc Cudennec, Safae Dahmani, G. Gogniat, Martha Johanna Sepúlveda","doi":"10.1145/2723772.2723779","DOIUrl":"https://doi.org/10.1145/2723772.2723779","url":null,"abstract":"Many-core processors are made by hundreds to thousands cores, distributed memories and a dedicated network on a single chip. In this context, and because of the scale of the processor, providing a shared memory system has to rely on efficient hardware mechanisms and/or data consistency protocols. Some works explored several consistency mechanisms designed for many-core processors. They lead to the conclusion that there won't exist one protocol that fits to all applications and hardware contexts. Therefore, it sounds relevant to use a multi-protocol platform, in which shared data of the application can be managed by different protocols. Protocols are chosen and configured at compile time, following a static analysis of the application and the profiling of memory accesses. In this work, we propose a high-level timed model that we use to evaluate, at compile time, the consistency protocol which has been assigned to a given application and a given Network-on-Chip (NoC). This model allows to calculate the number of NoC cycles needed for each data access, that can be turned into mean access cycles for each core or each shared data. The model is not as accurate as a cycle-based NoC simulator or an instruction set simulator. However, it is accurate enough to evaluate the impact of choosing and configuring a protocol, and its lightweight implementation allows to run within an operational research optimization loop. To validate our approach, we apply the model to compare three consistency protocols, on a 2D mesh network, compiling a parallel convolution application.","PeriodicalId":350480,"journal":{"name":"Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115843387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

The Basic Building Blocks of Parallel Tasks 并行任务的基本构建模块

Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores Pub Date : 2015-02-08 DOI: 10.1145/2723772.2723778

Rohit Atre, A. Jannesari, F. Wolf

引用次数: 5

An Evaluation of Memory Sharing Performance for Heterogeneous Embedded SoCs with Many-Core Accelerators 基于多核加速器的异构嵌入式soc内存共享性能评估

Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores Pub Date : 2015-02-08 DOI: 10.1145/2723772.2723775

Pirmin Vogel, A. Marongiu, L. Benini

引用次数: 7

Runtime Support for Multiple Offload-Based Programming Models on Embedded Manycore Accelerators 嵌入式多核加速器上多个基于卸载的编程模型的运行时支持

Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores Pub Date : 2015-02-08 DOI: 10.1145/2723772.2723773

Alessandro Capotondi, Germain Haugou, A. Marongiu, L. Benini

{"title":"Runtime Support for Multiple Offload-Based Programming Models on Embedded Manycore Accelerators","authors":"Alessandro Capotondi, Germain Haugou, A. Marongiu, L. Benini","doi":"10.1145/2723772.2723773","DOIUrl":"https://doi.org/10.1145/2723772.2723773","url":null,"abstract":"Many modern high-end embedded systems are designed as heterogeneous systems-on-chip (SoCs), where a powerful general purpose multicore host processor is coupled to a manycore accelerator. The host executes legacy applications on top of standard operating systems, while the accelerator runs highly parallel code kernels within those applications. Several programming models are currently being proposed to program such accelerator-based systems, OpenCL and OpenMP being the most relevant examples. In the near future it will be common to have multiple applications, coded with different programming models, concurrently requiring the use of the manycore accelerator. In this paper we present a runtime system for a cluster-based manycore accelerator, optimized for the concurrent execution of OpenMP and OpenCL kernels. The runtime supports spatial partitioning of the manycore, where clusters can be grouped into several \"virtual\" accelerator instances. Our runtime design is modular and relies on a \"generic\" component for resource (cluster) scheduling, plus \"specialized\" components which efficiently deploy generic offload requests into an implementation of the target programming model's semantics. We evaluate the proposed runtime system on a real heterogeneous system, the STMicroelectronics STHORM development board.","PeriodicalId":350480,"journal":{"name":"Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129333097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A Roadmap for a Type Architecture Based Parallel Programming Language 一种基于类型体系结构的并行编程语言的路线图

Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores Pub Date : 2015-02-08 DOI: 10.1145/2723772.2723774

Muhammad N. Yanhaona, A. Grimshaw

引用次数: 1

Exploiting Dynamic Parallelism to Efficiently Support Irregular Nested Loops on GPUs 利用动态并行性有效支持gpu上的不规则嵌套循环

Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores Pub Date : 2015-02-08 DOI: 10.1145/2723772.2723780

Da Li, Hancheng Wu, M. Becchi

引用次数: 6

Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores 2015多核和多核代码优化国际研讨会论文集

Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores Pub Date : 1900-01-01 DOI: 10.1145/2723772

引用次数: 0