Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores最新文献

筛选
英文 中文
Hardware-Aware Automatic Code-Transformation to Support Compilers in Exploiting the Multi-Level Parallel Potential of Modern CPUs 支持编译器开发现代cpu多级并行潜能的硬件感知自动代码转换
Dustin Feld, T. Soddemann, M. Jünger, Sven Mallach
{"title":"Hardware-Aware Automatic Code-Transformation to Support Compilers in Exploiting the Multi-Level Parallel Potential of Modern CPUs","authors":"Dustin Feld, T. Soddemann, M. Jünger, Sven Mallach","doi":"10.1145/2723772.2723776","DOIUrl":"https://doi.org/10.1145/2723772.2723776","url":null,"abstract":"Modern compilers offer more and more capabilities to automatically parallelize code-regions if these match certain properties. However, there are several application kernels that, although rather simple transformations would suffice in order to make them match these properties, are either not at all parallelized by state-of-the-art compilers or could at least be improved w.r.t. their performance. This paper proposes a loop-tiling approach focusing on automatic vectorization and multi-core parallelization, with emphasis on a smart cache exploitation. The method is based on polyhedral code transformations that are applied as a pre-compilation step and it is shown to help compilers in generating more and better parallel code-regions. It automatically adapts to hardware parameters such as the SIMD register width and cache sizes. Further, it takes memory-access patterns into account and is capable to minimize communication among tiles that are to be processed by different cores. An extensive computational study shows significant improvements in the number of instructions vectorized, cache miss rates, and running times for a range of application kernels. The method often outperforms the internal auto-parallelization techniques implemented into gcc and icc.","PeriodicalId":350480,"journal":{"name":"Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores","volume":"118 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121183145","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Dependence-Based Code Transformation for Coarse-Grained Parallelism 基于依赖的粗粒度并行代码转换
Bo Zhao, Zhen Li, A. Jannesari, F. Wolf, Weiguo Wu
{"title":"Dependence-Based Code Transformation for Coarse-Grained Parallelism","authors":"Bo Zhao, Zhen Li, A. Jannesari, F. Wolf, Weiguo Wu","doi":"10.1145/2723772.2723777","DOIUrl":"https://doi.org/10.1145/2723772.2723777","url":null,"abstract":"Multicore architectures are becoming more common today. Many software products implemented sequentially have failed to exploit the potential parallelism of multicore architectures. Significant re-engineering and refactoring of existing software is needed to support the use of new hardware features. Due to the high cost of manual transformation, an automated approach to transforming existing software and taking advantage of multicore architectures would be highly beneficial. We propose a novel auto-parallelization approach, which integrates data-dependence profiling, task parallelism extraction and source-to-source transformation. Coarse-grained task parallelism is detected based on a concept called Computational Unit(CU). We use dynamic profiling information to gather control- and data-dependences among tasks and generate a task graph. In addition, we develop a source-to-source transformation tool based on LLVM, which can perform high-level code restructuring. It transforms the generated task graph with loop parallelism and task parallelism of sequential code into parallel code using Intel Threading Building Blocks (TBB). We have evaluated NAS Parallel Benchmark applications, three applications from PARSEC benchmark suite, and real world applications. The obtained results confirm that our approach is able to achieve promising performance with minor user interference. The average speedups of loop parallelization and task parallelization are 3.12x and 9.92x respectively.","PeriodicalId":350480,"journal":{"name":"Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores","volume":"574 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123169855","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Cycle-based Model to Evaluate Consistency Protocols within a Multi-protocol Compilation Tool-chain 基于循环的多协议编译工具链一致性评估模型
Hamza Chaker, Loïc Cudennec, Safae Dahmani, G. Gogniat, Martha Johanna Sepúlveda
{"title":"Cycle-based Model to Evaluate Consistency Protocols within a Multi-protocol Compilation Tool-chain","authors":"Hamza Chaker, Loïc Cudennec, Safae Dahmani, G. Gogniat, Martha Johanna Sepúlveda","doi":"10.1145/2723772.2723779","DOIUrl":"https://doi.org/10.1145/2723772.2723779","url":null,"abstract":"Many-core processors are made by hundreds to thousands cores, distributed memories and a dedicated network on a single chip. In this context, and because of the scale of the processor, providing a shared memory system has to rely on efficient hardware mechanisms and/or data consistency protocols. Some works explored several consistency mechanisms designed for many-core processors. They lead to the conclusion that there won't exist one protocol that fits to all applications and hardware contexts. Therefore, it sounds relevant to use a multi-protocol platform, in which shared data of the application can be managed by different protocols. Protocols are chosen and configured at compile time, following a static analysis of the application and the profiling of memory accesses. In this work, we propose a high-level timed model that we use to evaluate, at compile time, the consistency protocol which has been assigned to a given application and a given Network-on-Chip (NoC). This model allows to calculate the number of NoC cycles needed for each data access, that can be turned into mean access cycles for each core or each shared data. The model is not as accurate as a cycle-based NoC simulator or an instruction set simulator. However, it is accurate enough to evaluate the impact of choosing and configuring a protocol, and its lightweight implementation allows to run within an operational research optimization loop. To validate our approach, we apply the model to compare three consistency protocols, on a 2D mesh network, compiling a parallel convolution application.","PeriodicalId":350480,"journal":{"name":"Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115843387","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
The Basic Building Blocks of Parallel Tasks 并行任务的基本构建模块
Rohit Atre, A. Jannesari, F. Wolf
{"title":"The Basic Building Blocks of Parallel Tasks","authors":"Rohit Atre, A. Jannesari, F. Wolf","doi":"10.1145/2723772.2723778","DOIUrl":"https://doi.org/10.1145/2723772.2723778","url":null,"abstract":"Discovery of parallelization opportunities in sequential programs can greatly reduce the time and effort required to parallelize any application. Identification and analysis of code that contains little to no internal parallelism can also help expose potential parallelism. This paper provides a technique to identify a block of code called Computational Unit (CU) that performs a unit of work in a program. A CU can assist in discovering the potential parallelism in a sequential program by acting as a basic building block for tasks. CUs are used along with dynamic analysis information to identify the tasks that contain tightly coupled code within them. This process in turn reveals the tasks that are weakly dependent or independent. The independent tasks can be run in parallel and the dependent tasks can be analyzed to check if the dependences can be resolved. To evaluate our technique, different benchmark applications are parallelized using our identified tasks and the speedups are reported. In addition, existing parallel implementations of the applications are compared with the identified tasks for the respective applications.","PeriodicalId":350480,"journal":{"name":"Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125571115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
An Evaluation of Memory Sharing Performance for Heterogeneous Embedded SoCs with Many-Core Accelerators 基于多核加速器的异构嵌入式soc内存共享性能评估
Pirmin Vogel, A. Marongiu, L. Benini
{"title":"An Evaluation of Memory Sharing Performance for Heterogeneous Embedded SoCs with Many-Core Accelerators","authors":"Pirmin Vogel, A. Marongiu, L. Benini","doi":"10.1145/2723772.2723775","DOIUrl":"https://doi.org/10.1145/2723772.2723775","url":null,"abstract":"Today's systems-on-chip (SoCs) more and more conform to the models envisioned by the Heterogeneous System Architecture (HSA) foundation in which massively parallel, programmable many-core accelerators (PMCAs) not only cooperate but also coherently share memory with a powerful, multi-core host processor. Allowing direct access to system memory from both sides greatly simplifies application development, but it increases the potential interference to the memory system due to the PMCA. In this work, we evaluate the impact of a PMCA's memory traffic on the host performance using the Xilinx Zynq-7000 SoC. This platform features a dual-core ARM Cortex-A9 CPU, as well as a field-programmable gate array (FPGA), which we use to model a PMCA. Synthetic workload, real benchmarks from the MiBench and ALPBench suites, and collaborative workloads all show that the interference generated by the PMCA can significantly reduce the memory bandwidth seen by the host (on average up to 25 % for host applications).","PeriodicalId":350480,"journal":{"name":"Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores","volume":"344 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115675018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Runtime Support for Multiple Offload-Based Programming Models on Embedded Manycore Accelerators 嵌入式多核加速器上多个基于卸载的编程模型的运行时支持
Alessandro Capotondi, Germain Haugou, A. Marongiu, L. Benini
{"title":"Runtime Support for Multiple Offload-Based Programming Models on Embedded Manycore Accelerators","authors":"Alessandro Capotondi, Germain Haugou, A. Marongiu, L. Benini","doi":"10.1145/2723772.2723773","DOIUrl":"https://doi.org/10.1145/2723772.2723773","url":null,"abstract":"Many modern high-end embedded systems are designed as heterogeneous systems-on-chip (SoCs), where a powerful general purpose multicore host processor is coupled to a manycore accelerator. The host executes legacy applications on top of standard operating systems, while the accelerator runs highly parallel code kernels within those applications. Several programming models are currently being proposed to program such accelerator-based systems, OpenCL and OpenMP being the most relevant examples. In the near future it will be common to have multiple applications, coded with different programming models, concurrently requiring the use of the manycore accelerator. In this paper we present a runtime system for a cluster-based manycore accelerator, optimized for the concurrent execution of OpenMP and OpenCL kernels. The runtime supports spatial partitioning of the manycore, where clusters can be grouped into several \"virtual\" accelerator instances. Our runtime design is modular and relies on a \"generic\" component for resource (cluster) scheduling, plus \"specialized\" components which efficiently deploy generic offload requests into an implementation of the target programming model's semantics. We evaluate the proposed runtime system on a real heterogeneous system, the STMicroelectronics STHORM development board.","PeriodicalId":350480,"journal":{"name":"Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129333097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
A Roadmap for a Type Architecture Based Parallel Programming Language 一种基于类型体系结构的并行编程语言的路线图
Muhammad N. Yanhaona, A. Grimshaw
{"title":"A Roadmap for a Type Architecture Based Parallel Programming Language","authors":"Muhammad N. Yanhaona, A. Grimshaw","doi":"10.1145/2723772.2723774","DOIUrl":"https://doi.org/10.1145/2723772.2723774","url":null,"abstract":"Ever since the end of the era of single processor performance improvement, we observe proliferation of multi and many-core architectures in almost all spheres of computing. Tablets, desktop PCs, workstation clusters, and supercomputers are ripe with multi-core CPUs and/or accelerators. Although there are considerable architectural heterogeneity and scale differences present among these machine architectures, there is also significant commonality in their multi-core building blocks. This situation revives the interest for possibility of parallel programming paradigms that are both efficient and portable across environments. We have been investigating such a programming paradigm for over a year and half now. Two key aspects of our work are development of a common machine abstraction for diverse hardware platforms and a clear separation of different concerns embodied in parallel programming. This is our first paper describing the philosophy, abstraction mechanism, programming model, and early results of our ongoing project.","PeriodicalId":350480,"journal":{"name":"Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129638372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Exploiting Dynamic Parallelism to Efficiently Support Irregular Nested Loops on GPUs 利用动态并行性有效支持gpu上的不规则嵌套循环
Da Li, Hancheng Wu, M. Becchi
{"title":"Exploiting Dynamic Parallelism to Efficiently Support Irregular Nested Loops on GPUs","authors":"Da Li, Hancheng Wu, M. Becchi","doi":"10.1145/2723772.2723780","DOIUrl":"https://doi.org/10.1145/2723772.2723780","url":null,"abstract":"Graphics Processing Units (GPUs) have been used in general purpose computing for several years. The newly introduced Dynamic Parallelism feature of Nvidia's Kepler GPUs allows launching kernels from the GPU directly. However, the naïve use of this feature can cause a high number of nested kernel launches, each performing limited work, leading to GPU underutilization and poor performance. We propose workload consolidation mechanisms at different granularities to maximize the work performed by nested kernels and reduce their overhead. Our end goal is to design automatic code transformation techniques for applications with irregular nested loops.","PeriodicalId":350480,"journal":{"name":"Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-02-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129668954","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores 2015多核和多核代码优化国际研讨会论文集
{"title":"Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores","authors":"","doi":"10.1145/2723772","DOIUrl":"https://doi.org/10.1145/2723772","url":null,"abstract":"","PeriodicalId":350480,"journal":{"name":"Proceedings of the 2015 International Workshop on Code Optimisation for Multi and Many Cores","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133559546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信