异构系统的高效数据共享

2017 46th International Conference on Parallel Processing (ICPP) Pub Date : 2017-08-01 DOI:10.1109/ICPP.2017.21

Victor Garcia-Flores, E. Ayguadé, Antonio J. Peña

{"title":"异构系统的高效数据共享","authors":"Victor Garcia-Flores, E. Ayguadé, Antonio J. Peña","doi":"10.1109/ICPP.2017.21","DOIUrl":null,"url":null,"abstract":"General-purpose computing on GPUs has become more accessible due to features such as shared virtual memory and demand paging. Unfortunately it comes at a price, and that is performance. Automatic memory management is convenient but suffers from many drawbacks, preventing heterogeneous systems from achieving their full potential. In this work we analyze the challenges and inefficiencies of demand paging in GPUs, in particular on collaborative computations where data migrates multiple times between host and device. We establish that demand paging on GPUs introduces significant overheads for these kind of computations, and identify the issues of false sharing and unnecessary data transfers derived from the granularity at which data is migrated. In order to alleviate these problems we propose a memory organization and dynamic migration scheme to efficiently share data between host and device at fine granularities and without software intervention. We evaluate our design with a set of collaborative heterogeneous benchmarks and find it achieves 15% lower execution times on average with cache line-sized migrations, but severely degrading performance on benchmarks that access large blocks of contiguous memory. Page-sized migrations, although inefficient, provide on average a 47% execution time reduction with our design over a baseline system implementing demand paging. Our results suggest that cache line-sized migrations are not feasible in systems using a PCI-Express interconnect. In order to understand how future interconnect technologies will impact the feasibility of fine-grained migrations, we evaluate our scheme with various link latencies. We find interconnect latencies four to five times lower than PCI-Express are sufficient to effectively share data at finer granularities.","PeriodicalId":392710,"journal":{"name":"2017 46th International Conference on Parallel Processing (ICPP)","volume":"180 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"Efficient Data Sharing on Heterogeneous Systems\",\"authors\":\"Victor Garcia-Flores, E. Ayguadé, Antonio J. Peña\",\"doi\":\"10.1109/ICPP.2017.21\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"General-purpose computing on GPUs has become more accessible due to features such as shared virtual memory and demand paging. Unfortunately it comes at a price, and that is performance. Automatic memory management is convenient but suffers from many drawbacks, preventing heterogeneous systems from achieving their full potential. In this work we analyze the challenges and inefficiencies of demand paging in GPUs, in particular on collaborative computations where data migrates multiple times between host and device. We establish that demand paging on GPUs introduces significant overheads for these kind of computations, and identify the issues of false sharing and unnecessary data transfers derived from the granularity at which data is migrated. In order to alleviate these problems we propose a memory organization and dynamic migration scheme to efficiently share data between host and device at fine granularities and without software intervention. We evaluate our design with a set of collaborative heterogeneous benchmarks and find it achieves 15% lower execution times on average with cache line-sized migrations, but severely degrading performance on benchmarks that access large blocks of contiguous memory. Page-sized migrations, although inefficient, provide on average a 47% execution time reduction with our design over a baseline system implementing demand paging. Our results suggest that cache line-sized migrations are not feasible in systems using a PCI-Express interconnect. In order to understand how future interconnect technologies will impact the feasibility of fine-grained migrations, we evaluate our scheme with various link latencies. We find interconnect latencies four to five times lower than PCI-Express are sufficient to effectively share data at finer granularities.\",\"PeriodicalId\":392710,\"journal\":{\"name\":\"2017 46th International Conference on Parallel Processing (ICPP)\",\"volume\":\"180 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2017 46th International Conference on Parallel Processing (ICPP)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICPP.2017.21\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 46th International Conference on Parallel Processing (ICPP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICPP.2017.21","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 8

摘要

由于共享虚拟内存和需求分页等特性，gpu上的通用计算变得更容易访问。不幸的是，这是有代价的，那就是性能。自动内存管理很方便，但有许多缺点，妨碍异构系统充分发挥其潜力。在这项工作中，我们分析了gpu中需求分页的挑战和低效率，特别是在数据在主机和设备之间多次迁移的协作计算中。我们确定gpu上的需求分页为这类计算带来了显著的开销，并确定了从数据迁移的粒度派生的错误共享和不必要的数据传输的问题。为了缓解这些问题，我们提出了一种内存组织和动态迁移方案，在没有软件干预的情况下，在主机和设备之间以细粒度有效地共享数据。我们用一组协作异构基准测试来评估我们的设计，发现通过缓存行大小的迁移，平均执行时间降低了15%，但在访问大块连续内存的基准测试中，性能会严重下降。页面大小的迁移虽然效率低下，但在实现需求分页的基线系统上，我们的设计平均减少了47%的执行时间。我们的结果表明，在使用PCI-Express互连的系统中，缓存行大小的迁移是不可行的。为了理解未来的互连技术将如何影响细粒度迁移的可行性，我们用各种链路延迟来评估我们的方案。我们发现互连延迟比PCI-Express低4到5倍，足以有效地以更细的粒度共享数据。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

Efficient Data Sharing on Heterogeneous Systems

General-purpose computing on GPUs has become more accessible due to features such as shared virtual memory and demand paging. Unfortunately it comes at a price, and that is performance. Automatic memory management is convenient but suffers from many drawbacks, preventing heterogeneous systems from achieving their full potential. In this work we analyze the challenges and inefficiencies of demand paging in GPUs, in particular on collaborative computations where data migrates multiple times between host and device. We establish that demand paging on GPUs introduces significant overheads for these kind of computations, and identify the issues of false sharing and unnecessary data transfers derived from the granularity at which data is migrated. In order to alleviate these problems we propose a memory organization and dynamic migration scheme to efficiently share data between host and device at fine granularities and without software intervention. We evaluate our design with a set of collaborative heterogeneous benchmarks and find it achieves 15% lower execution times on average with cache line-sized migrations, but severely degrading performance on benchmarks that access large blocks of contiguous memory. Page-sized migrations, although inefficient, provide on average a 47% execution time reduction with our design over a baseline system implementing demand paging. Our results suggest that cache line-sized migrations are not feasible in systems using a PCI-Express interconnect. In order to understand how future interconnect technologies will impact the feasibility of fine-grained migrations, we evaluate our scheme with various link latencies. We find interconnect latencies four to five times lower than PCI-Express are sufficient to effectively share data at finer granularities.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2017 46th International Conference on Parallel Processing (ICPP)

自引率

0.00%

发文量