并行局部性和并行化质量

Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores Pub Date : 2016-03-12 DOI:10.1145/2883404.2883410

Bernard Goossens, David Parello, Katarzyna Porada, Djallal Rahmoune

{"title":"并行局部性和并行化质量","authors":"Bernard Goossens, David Parello, Katarzyna Porada, Djallal Rahmoune","doi":"10.1145/2883404.2883410","DOIUrl":null,"url":null,"abstract":"This paper presents a new distributed computation model adapted to manycore processors. In this model, the run is spread on the available cores by fork machine instructions produced by the compiler, for example at function calls and loops iterations. This approach is to be opposed to the actual model of computation based on cache and predictor. Cache efficiency relies on data locality and predictor efficiency relies on the reproducibility of the control. Data locality and control reproducibility are less effective when the execution is distributed. The computation model proposed is based on a new core hardware. Its main features are described in this paper. This new core is the building block of a manycore design. The processor automatically parallelizes an execution. It keeps the computation deterministic by constructing a totally ordered trace of the machine instructions run. References are renamed, including memory, which fixes the communications and synchronizations needs. When a data is referenced, its producer is found in the trace and the reader is synchronized with the writer. This paper shows how a consumer can be located in the same core as its producer, improving parallel locality and parallelization quality. Our deterministic and fine grain distribution of a run on a manycore processor is compared with OS primitives and API based parallelization (e.g. pthread, OpenMP or MPI) and to compiler automatic parallelization of loops. The former implies (i) a high OS overhead meaning that only coarse grain parallelization is cost-effective and (ii) a non deterministic behaviour meaning that appropriate synchronization to eliminate wrong results is a challenge. The latter is unable to fully parallelize general purpose programs due to structures like functions, complex loops and branches.","PeriodicalId":185841,"journal":{"name":"Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores","volume":"205 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":"{\"title\":\"Parallel Locality and Parallelization Quality\",\"authors\":\"Bernard Goossens, David Parello, Katarzyna Porada, Djallal Rahmoune\",\"doi\":\"10.1145/2883404.2883410\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper presents a new distributed computation model adapted to manycore processors. In this model, the run is spread on the available cores by fork machine instructions produced by the compiler, for example at function calls and loops iterations. This approach is to be opposed to the actual model of computation based on cache and predictor. Cache efficiency relies on data locality and predictor efficiency relies on the reproducibility of the control. Data locality and control reproducibility are less effective when the execution is distributed. The computation model proposed is based on a new core hardware. Its main features are described in this paper. This new core is the building block of a manycore design. The processor automatically parallelizes an execution. It keeps the computation deterministic by constructing a totally ordered trace of the machine instructions run. References are renamed, including memory, which fixes the communications and synchronizations needs. When a data is referenced, its producer is found in the trace and the reader is synchronized with the writer. This paper shows how a consumer can be located in the same core as its producer, improving parallel locality and parallelization quality. Our deterministic and fine grain distribution of a run on a manycore processor is compared with OS primitives and API based parallelization (e.g. pthread, OpenMP or MPI) and to compiler automatic parallelization of loops. The former implies (i) a high OS overhead meaning that only coarse grain parallelization is cost-effective and (ii) a non deterministic behaviour meaning that appropriate synchronization to eliminate wrong results is a challenge. The latter is unable to fully parallelize general purpose programs due to structures like functions, complex loops and branches.\",\"PeriodicalId\":185841,\"journal\":{\"name\":\"Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores\",\"volume\":\"205 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-03-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"2\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2883404.2883410\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2883404.2883410","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 2

摘要

提出了一种适用于多核处理器的分布式计算模型。在这个模型中，通过编译器产生的fork机器指令，例如在函数调用和循环迭代时，将运行分散在可用的内核上。这种方法与基于缓存和预测器的实际计算模型是相反的。缓存效率依赖于数据的局部性，预测器效率依赖于控制的可重复性。当执行是分布式的时，数据的局部性和控制的可再现性就不那么有效了。提出的计算模型是基于一种新的核心硬件。本文介绍了它的主要特点。这个新核心是多核心设计的构建块。处理器自动并行执行。它通过构造机器指令运行的完全有序的跟踪来保持计算的确定性。引用被重命名，包括内存，这修复了通信和同步需求。当一个数据被引用时，在跟踪中找到它的生产者，并且读取器与写入器同步。本文展示了如何将消费者与其生产者定位在同一核心中，从而提高并行的局部性和并行化质量。我们在多核处理器上运行的确定性和细粒度分布与操作系统原语和基于API的并行化(例如pthread, OpenMP或MPI)以及编译器自动循环并行化进行了比较。前者意味着(i)高操作系统开销，这意味着只有粗粒度并行化才具有成本效益;(ii)不确定性行为，这意味着适当的同步以消除错误的结果是一项挑战。由于函数、复杂循环和分支等结构，后者无法完全并行化通用程序。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Parallel Locality and Parallelization Quality

This paper presents a new distributed computation model adapted to manycore processors. In this model, the run is spread on the available cores by fork machine instructions produced by the compiler, for example at function calls and loops iterations. This approach is to be opposed to the actual model of computation based on cache and predictor. Cache efficiency relies on data locality and predictor efficiency relies on the reproducibility of the control. Data locality and control reproducibility are less effective when the execution is distributed. The computation model proposed is based on a new core hardware. Its main features are described in this paper. This new core is the building block of a manycore design. The processor automatically parallelizes an execution. It keeps the computation deterministic by constructing a totally ordered trace of the machine instructions run. References are renamed, including memory, which fixes the communications and synchronizations needs. When a data is referenced, its producer is found in the trace and the reader is synchronized with the writer. This paper shows how a consumer can be located in the same core as its producer, improving parallel locality and parallelization quality. Our deterministic and fine grain distribution of a run on a manycore processor is compared with OS primitives and API based parallelization (e.g. pthread, OpenMP or MPI) and to compiler automatic parallelization of loops. The former implies (i) a high OS overhead meaning that only coarse grain parallelization is cost-effective and (ii) a non deterministic behaviour meaning that appropriate synchronization to eliminate wrong results is a challenge. The latter is unable to fully parallelize general purpose programs due to structures like functions, complex loops and branches.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 7th International Workshop on Programming Models and Applications for Multicores and Manycores

自引率

0.00%

发文量