Optimized memory access support for data layout conversion on heterogeneous multi-core systems

2014 IEEE 12th Symposium on Embedded Systems for Real-time Multimedia (ESTIMedia) Pub Date : 2014-11-24 DOI:10.1109/ESTIMedia.2014.6962353

C.C.-H. Hsu, Cheng-Yen Lin, Shin-Kai Chen, Chih-Wei Liu, Jenq-Kuen Lee

{"title":"Optimized memory access support for data layout conversion on heterogeneous multi-core systems","authors":"C.C.-H. Hsu, Cheng-Yen Lin, Shin-Kai Chen, Chih-Wei Liu, Jenq-Kuen Lee","doi":"10.1109/ESTIMedia.2014.6962353","DOIUrl":null,"url":null,"abstract":"Heterogeneous multi-core systems that contain multiple CPUs and GPUs are gaining momentum, as they are providing different computation power to meet the performance demand of modern applications. On such systems, developers try to fully utilize the computation power both for CPU and GPU by using the emerging programming models such as CUDA and OpenCL. To achieve the maximal performance, developers must carefully offload the appropriate workload to the compute devices according to the characteristics of target architecture. Under such scenario, seamlessly data motion between different processors become crucial. Additionally, re-organizing the data layout to fit the target architectures, such as array-of-structure (AOS) for CPU, structure-of-array (SOA) for GPU, and coordinate (COO) format to ELLPACK (ELL) for sparse computation, address such concern. In this paper, we propose a hardware memory manager, which efficiently optimizes the conversion of data layouts for heterogeneous multi-core systems on-the-fly. We address coalescing and sparse format conversion issue in our design. A novel ping-pong transpose architecture is devised to reorganize non-coalescing access pattern, and a histogram unit and sparse address generator are presented to process sparse storage format transformation. Our design reduces the overhead of data transfer and layout transformation among CPU and GPU. In our experiment, our design achieves 68.5 to 2.19 times speed up comparing to software-based library depending on data size.","PeriodicalId":265392,"journal":{"name":"2014 IEEE 12th Symposium on Embedded Systems for Real-time Multimedia (ESTIMedia)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 12th Symposium on Embedded Systems for Real-time Multimedia (ESTIMedia)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ESTIMedia.2014.6962353","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

Abstract

Heterogeneous multi-core systems that contain multiple CPUs and GPUs are gaining momentum, as they are providing different computation power to meet the performance demand of modern applications. On such systems, developers try to fully utilize the computation power both for CPU and GPU by using the emerging programming models such as CUDA and OpenCL. To achieve the maximal performance, developers must carefully offload the appropriate workload to the compute devices according to the characteristics of target architecture. Under such scenario, seamlessly data motion between different processors become crucial. Additionally, re-organizing the data layout to fit the target architectures, such as array-of-structure (AOS) for CPU, structure-of-array (SOA) for GPU, and coordinate (COO) format to ELLPACK (ELL) for sparse computation, address such concern. In this paper, we propose a hardware memory manager, which efficiently optimizes the conversion of data layouts for heterogeneous multi-core systems on-the-fly. We address coalescing and sparse format conversion issue in our design. A novel ping-pong transpose architecture is devised to reorganize non-coalescing access pattern, and a histogram unit and sparse address generator are presented to process sparse storage format transformation. Our design reduces the overhead of data transfer and layout transformation among CPU and GPU. In our experiment, our design achieves 68.5 to 2.19 times speed up comparing to software-based library depending on data size.

查看原文本刊更多论文

优化内存访问支持异构多核系统上的数据布局转换

包含多个cpu和gpu的异构多核系统正在获得发展势头，因为它们提供不同的计算能力来满足现代应用程序的性能需求。在这样的系统上，开发人员通过使用CUDA和OpenCL等新兴编程模型，试图充分利用CPU和GPU的计算能力。为了获得最大的性能，开发人员必须根据目标体系结构的特征仔细地将适当的工作负载卸载到计算设备上。在这种情况下，不同处理器之间的无缝数据移动变得至关重要。此外，重新组织数据布局以适应目标架构，例如针对CPU的数组结构(AOS)，针对GPU的数组结构(SOA)，以及针对稀疏计算的ELLPACK (ELL)的坐标(COO)格式，可以解决这些问题。本文提出了一种硬件内存管理器，可以有效地优化异构多核系统的动态数据布局转换。我们在设计中解决了合并和稀疏格式转换问题。提出了一种新的乒乓转置结构来重组非聚并访问模式，并提出了直方图单元和稀疏地址发生器来处理稀疏存储格式转换。我们的设计减少了CPU和GPU之间的数据传输和布局转换的开销。在我们的实验中，我们的设计与基于软件的库相比，根据数据大小的不同，速度提高了68.5到2.19倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2014 IEEE 12th Symposium on Embedded Systems for Real-time Multimedia (ESTIMedia)

自引率

0.00%

发文量