ELMO:在OpenCL内核中启用本地内存的用户友好API

2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing Pub Date : 2013-02-27 DOI:10.1109/pdp.2013.61

Jianbin Fang, A. Varbanescu, Jie Shen, H. Sips

{"title":"ELMO:在OpenCL内核中启用本地内存的用户友好API","authors":"Jianbin Fang, A. Varbanescu, Jie Shen, H. Sips","doi":"10.1109/pdp.2013.61","DOIUrl":null,"url":null,"abstract":"Recent parallel architectures are equipped with local memory, which simplifies hardware design at the cost of increased program complexity due to explicit management. To simplify this extra-burden that programmers have, we introduce an easy-to-use API, ELMO, that improves productivity while preserving high performance of local memory operations. Specifically, ELMO is a generic API that covers different local memory use-cases. We also present prototype implementations for these APIs and perform multiple GPU-inspired optimizations to maximize their performance. Experimental results on the NVIDIA Quadro5000 GPU show that performance is significantly improved by using ELMO on native implementations: the achieved speedup ranges from 1.3x to 3.7x. Furthermore, using ELMO we still achieve performance comparable (if not better) with that of hand-tuned applications, while the code is shorter, clearer, and safer.","PeriodicalId":202977,"journal":{"name":"2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing","volume":"44 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"11","resultStr":"{\"title\":\"ELMO: A User-Friendly API to Enable Local Memory in OpenCL Kernels\",\"authors\":\"Jianbin Fang, A. Varbanescu, Jie Shen, H. Sips\",\"doi\":\"10.1109/pdp.2013.61\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent parallel architectures are equipped with local memory, which simplifies hardware design at the cost of increased program complexity due to explicit management. To simplify this extra-burden that programmers have, we introduce an easy-to-use API, ELMO, that improves productivity while preserving high performance of local memory operations. Specifically, ELMO is a generic API that covers different local memory use-cases. We also present prototype implementations for these APIs and perform multiple GPU-inspired optimizations to maximize their performance. Experimental results on the NVIDIA Quadro5000 GPU show that performance is significantly improved by using ELMO on native implementations: the achieved speedup ranges from 1.3x to 3.7x. Furthermore, using ELMO we still achieve performance comparable (if not better) with that of hand-tuned applications, while the code is shorter, clearer, and safer.\",\"PeriodicalId\":202977,\"journal\":{\"name\":\"2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing\",\"volume\":\"44 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-02-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"11\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/pdp.2013.61\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/pdp.2013.61","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 11

摘要

最近的并行体系结构配备了本地内存，这简化了硬件设计，但代价是由于显式管理而增加了程序复杂性。为了简化程序员的额外负担，我们引入了一个易于使用的API ELMO，它可以提高生产力，同时保持本地内存操作的高性能。具体来说，ELMO是一个通用API，涵盖了不同的本地内存用例。我们还提供了这些api的原型实现，并执行了多个gpu启发的优化以最大化其性能。在NVIDIA Quadro5000 GPU上的实验结果表明，在本机实现上使用ELMO可以显著提高性能:实现的加速范围从1.3倍到3.7倍。此外，使用ELMO，我们仍然可以获得与手动调优应用程序相当(如果不是更好的话)的性能，同时代码更短、更清晰、更安全。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

ELMO: A User-Friendly API to Enable Local Memory in OpenCL Kernels

Recent parallel architectures are equipped with local memory, which simplifies hardware design at the cost of increased program complexity due to explicit management. To simplify this extra-burden that programmers have, we introduce an easy-to-use API, ELMO, that improves productivity while preserving high performance of local memory operations. Specifically, ELMO is a generic API that covers different local memory use-cases. We also present prototype implementations for these APIs and perform multiple GPU-inspired optimizations to maximize their performance. Experimental results on the NVIDIA Quadro5000 GPU show that performance is significantly improved by using ELMO on native implementations: the achieved speedup ranges from 1.3x to 3.7x. Furthermore, using ELMO we still achieve performance comparable (if not better) with that of hand-tuned applications, while the code is shorter, clearer, and safer.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing

自引率

0.00%

发文量