GPGPU同步原语在cpu上的高效实现

Proceedings of the 7th ACM international conference on Computing frontiers Pub Date : 2010-05-17 DOI:10.1145/1787275.1787295

J. Gummaraju, B. Sander, L. Morichetti, Benedict R. Gaster, Lee W. Howes

{"title":"GPGPU同步原语在cpu上的高效实现","authors":"J. Gummaraju, B. Sander, L. Morichetti, Benedict R. Gaster, Lee W. Howes","doi":"10.1145/1787275.1787295","DOIUrl":null,"url":null,"abstract":"The GPGPU model represents a style of execution where thousands of threads execute in a data-parallel fashion, with a large subset (typically 10s to 100s) needing frequent synchronization. As the GPGPU model evolves target both GPUs and CPUs as acceleration targets, thread synchronization becomes an important problem when running on CPUs. CPUs have little hardware support for synchronization and must be emulated in software, reducing application performance. This paper presents software techniques to implement the GPGPU synchronization primitives on CPUs, while maintaining application debug-ability. Performing limit studies using real hardware, we evaluate the potential performance benefits of an efficient barrier primitive.","PeriodicalId":151791,"journal":{"name":"Proceedings of the 7th ACM international conference on Computing frontiers","volume":"12 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-05-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Efficient implementation of GPGPU synchronization primitives on CPUs\",\"authors\":\"J. Gummaraju, B. Sander, L. Morichetti, Benedict R. Gaster, Lee W. Howes\",\"doi\":\"10.1145/1787275.1787295\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The GPGPU model represents a style of execution where thousands of threads execute in a data-parallel fashion, with a large subset (typically 10s to 100s) needing frequent synchronization. As the GPGPU model evolves target both GPUs and CPUs as acceleration targets, thread synchronization becomes an important problem when running on CPUs. CPUs have little hardware support for synchronization and must be emulated in software, reducing application performance. This paper presents software techniques to implement the GPGPU synchronization primitives on CPUs, while maintaining application debug-ability. Performing limit studies using real hardware, we evaluate the potential performance benefits of an efficient barrier primitive.\",\"PeriodicalId\":151791,\"journal\":{\"name\":\"Proceedings of the 7th ACM international conference on Computing frontiers\",\"volume\":\"12 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2010-05-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 7th ACM international conference on Computing frontiers\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1787275.1787295\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 7th ACM international conference on Computing frontiers","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1787275.1787295","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

GPGPU模型代表了一种执行风格，其中数千个线程以数据并行的方式执行，其中较大的子集(通常为10到100个)需要频繁同步。随着GPGPU模型以gpu和cpu为加速目标的发展，线程同步成为cpu上运行的一个重要问题。cpu对同步的硬件支持很少，必须在软件中进行模拟，从而降低了应用程序的性能。本文介绍了在保持应用程序可调试性的同时，在cpu上实现GPGPU同步原语的软件技术。通过使用真实硬件进行极限研究，我们评估了高效屏障原语的潜在性能优势。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Efficient implementation of GPGPU synchronization primitives on CPUs

The GPGPU model represents a style of execution where thousands of threads execute in a data-parallel fashion, with a large subset (typically 10s to 100s) needing frequent synchronization. As the GPGPU model evolves target both GPUs and CPUs as acceleration targets, thread synchronization becomes an important problem when running on CPUs. CPUs have little hardware support for synchronization and must be emulated in software, reducing application performance. This paper presents software techniques to implement the GPGPU synchronization primitives on CPUs, while maintaining application debug-ability. Performing limit studies using real hardware, we evaluate the potential performance benefits of an efficient barrier primitive.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 7th ACM international conference on Computing frontiers

自引率

0.00%

发文量