在Intel SCC上实现类似openmp并行的高效屏障同步

2013 International Conference on Parallel and Distributed Systems Pub Date : 1900-01-01 DOI:10.1109/ICPADS.2013.15

H. Al-Khalissi, R. Buchty, Mladen Berekovic

{"title":"在Intel SCC上实现类似openmp并行的高效屏障同步","authors":"H. Al-Khalissi, R. Buchty, Mladen Berekovic","doi":"10.1109/ICPADS.2013.15","DOIUrl":null,"url":null,"abstract":"The continuous increase of the number of processing cores on die poses a new set of challenges to HPC applications programming including how to model, write, and verify software that has to use the full power of NoC-based manycore processors. Therefore, to simplify program development for the Single-chip Cloud Computer (SCC), it is desirable to have high-level, shared memory-based parallel programming abstractions (e.g., an OpenMP-like programming model). One of the key components of any similar programming model are barrier synchronization primitives, coordinating the work of parallel threads. To allow high-level barrier constructs to deliver good performance, we need an efficient implementation of the underlying synchronization algorithm. In this paper, we propose effective barrier synchronization implementations for shared-memory programming on non-cache-coherent cluster-on-chip represented by the Intel SCC. In particular, we present an extensive evaluation of the overhead associated with integrating barrier algorithms required for OpenMP runtime libraries on such a machine, validating several implementation variants that efficiently exploit the network topology and leveraging SCC-specific hardware. We provide a detailed evaluation of the performance achieved by different approaches by using micro-benchmarks.","PeriodicalId":160979,"journal":{"name":"2013 International Conference on Parallel and Distributed Systems","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Efficient Barrier Synchronization for OpenMP-Like Parallelism on the Intel SCC\",\"authors\":\"H. Al-Khalissi, R. Buchty, Mladen Berekovic\",\"doi\":\"10.1109/ICPADS.2013.15\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The continuous increase of the number of processing cores on die poses a new set of challenges to HPC applications programming including how to model, write, and verify software that has to use the full power of NoC-based manycore processors. Therefore, to simplify program development for the Single-chip Cloud Computer (SCC), it is desirable to have high-level, shared memory-based parallel programming abstractions (e.g., an OpenMP-like programming model). One of the key components of any similar programming model are barrier synchronization primitives, coordinating the work of parallel threads. To allow high-level barrier constructs to deliver good performance, we need an efficient implementation of the underlying synchronization algorithm. In this paper, we propose effective barrier synchronization implementations for shared-memory programming on non-cache-coherent cluster-on-chip represented by the Intel SCC. In particular, we present an extensive evaluation of the overhead associated with integrating barrier algorithms required for OpenMP runtime libraries on such a machine, validating several implementation variants that efficiently exploit the network topology and leveraging SCC-specific hardware. We provide a detailed evaluation of the performance achieved by different approaches by using micro-benchmarks.\",\"PeriodicalId\":160979,\"journal\":{\"name\":\"2013 International Conference on Parallel and Distributed Systems\",\"volume\":\"16 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 International Conference on Parallel and Distributed Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICPADS.2013.15\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 International Conference on Parallel and Distributed Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPADS.2013.15","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

芯片上处理内核数量的不断增加给高性能计算应用程序编程带来了一系列新的挑战，包括如何建模、编写和验证必须使用基于cpu的多核处理器的全部功能的软件。因此，为了简化单芯片云计算机(SCC)的程序开发，需要具有高级的、基于共享内存的并行编程抽象(例如，类似openmp的编程模型)。任何类似编程模型的关键组件之一是屏障同步原语，用于协调并行线程的工作。为了允许高级屏障构造提供良好的性能，我们需要底层同步算法的有效实现。在本文中，我们提出了以英特尔SCC为代表的非缓存一致性片上集群的共享内存编程的有效屏障同步实现。特别是，我们对在这样一台机器上集成OpenMP运行时库所需的屏障算法相关的开销进行了广泛的评估，验证了几种有效利用网络拓扑和利用scc特定硬件的实现变体。我们通过使用微基准对不同方法实现的性能进行了详细的评估。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Efficient Barrier Synchronization for OpenMP-Like Parallelism on the Intel SCC

The continuous increase of the number of processing cores on die poses a new set of challenges to HPC applications programming including how to model, write, and verify software that has to use the full power of NoC-based manycore processors. Therefore, to simplify program development for the Single-chip Cloud Computer (SCC), it is desirable to have high-level, shared memory-based parallel programming abstractions (e.g., an OpenMP-like programming model). One of the key components of any similar programming model are barrier synchronization primitives, coordinating the work of parallel threads. To allow high-level barrier constructs to deliver good performance, we need an efficient implementation of the underlying synchronization algorithm. In this paper, we propose effective barrier synchronization implementations for shared-memory programming on non-cache-coherent cluster-on-chip represented by the Intel SCC. In particular, we present an extensive evaluation of the overhead associated with integrating barrier algorithms required for OpenMP runtime libraries on such a machine, validating several implementation variants that efficiently exploit the network topology and leveraging SCC-specific hardware. We provide a detailed evaluation of the performance achieved by different approaches by using micro-benchmarks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2013 International Conference on Parallel and Distributed Systems

自引率

0.00%

发文量