POSTER: ξ-TAO: A cache-centric execution model and runtime for deep parallel multicore topologies

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) Pub Date : 2016-09-11 DOI:10.1145/2967938.2974052

M. Pericàs

{"title":"POSTER: ξ-TAO: A cache-centric execution model and runtime for deep parallel multicore topologies","authors":"M. Pericàs","doi":"10.1145/2967938.2974052","DOIUrl":null,"url":null,"abstract":"We have analyzed the ξ-TAO model and runtime with three benchmarks: a parallel hybrid quicksort/mergesort, a 2D Jacobi stencil, and the Unbalanced Tree Search (UTS) benchmark. We run ξ-TAO implementations of these benchmarks on a Dell PowerEdge R815 server with four AMD Opteron 6348 processors, totalling 8 NUMA nodes and 48 cores. Figure 2 shows the scalability of UTS+ξ-TAO compared to thread-centric runtimes based on work stealing (MassiveThreads [6], Intel TBB) and hierarchical WS+PDF (Qthreads [10]). UTS was implemented in ξ-TAO by grouping sibling nodes into a TAO and attaching a static scheduler. UTS has a very small working set, hence the best performance is achieved when each TAO is mapped to a single core (ξ-TAO-w1). The combination of tight reuse, pre-built task groups and static scheduling results in high scalability for UTS+ξ-TAO. Unlike UTS, the parallel sorting and 2D Jacobi benchmarks are memory intensive benchmarks. By selecting assemblies of width two (i.e., core-width of the L2 caches) and six (i.e., core-width of the L3 cache) ξ-TAO is able to outperform competing approaches thanks to better management of available memory bandwidth and shared cache capacity.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"154 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2967938.2974052","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

We have analyzed the ξ-TAO model and runtime with three benchmarks: a parallel hybrid quicksort/mergesort, a 2D Jacobi stencil, and the Unbalanced Tree Search (UTS) benchmark. We run ξ-TAO implementations of these benchmarks on a Dell PowerEdge R815 server with four AMD Opteron 6348 processors, totalling 8 NUMA nodes and 48 cores. Figure 2 shows the scalability of UTS+ξ-TAO compared to thread-centric runtimes based on work stealing (MassiveThreads [6], Intel TBB) and hierarchical WS+PDF (Qthreads [10]). UTS was implemented in ξ-TAO by grouping sibling nodes into a TAO and attaching a static scheduler. UTS has a very small working set, hence the best performance is achieved when each TAO is mapped to a single core (ξ-TAO-w1). The combination of tight reuse, pre-built task groups and static scheduling results in high scalability for UTS+ξ-TAO. Unlike UTS, the parallel sorting and 2D Jacobi benchmarks are memory intensive benchmarks. By selecting assemblies of width two (i.e., core-width of the L2 caches) and six (i.e., core-width of the L3 cache) ξ-TAO is able to outperform competing approaches thanks to better management of available memory bandwidth and shared cache capacity.

查看原文本刊更多论文

POSTER: ξ-TAO:一个以缓存为中心的执行模型和运行时，用于深度并行多核拓扑

我们用三个基准分析了ξ-TAO模型和运行时:并行混合快速排序/合并排序，2D Jacobi模板和不平衡树搜索(UTS)基准。我们在戴尔PowerEdge R815服务器上运行了这些基准测试的⊗- tao实现，该服务器配备了四个AMD Opteron 6348处理器，共有8个NUMA节点和48个核心。图2显示了UTS+ξ-TAO与基于工作窃取(MassiveThreads [6]， Intel TBB)和分层WS+PDF (Qthreads[10])的以线程为中心的运行时相比的可扩展性。UTS在ξ-TAO中通过将兄弟节点分组到一个TAO中并附加一个静态调度器来实现。UTS具有非常小的工作集，因此当每个TAO映射到单个核心(ξ-TAO-w1)时，可以实现最佳性能。紧密复用、预构建任务组和静态调度相结合，使UTS+ξ-TAO具有较高的可扩展性。与UTS不同，并行排序和2D Jacobi基准测试是内存密集型基准测试。通过选择宽度为2(即L2缓存的核心宽度)和6(即L3缓存的核心宽度)的组件，由于更好地管理可用内存带宽和共享缓存容量，ξ-TAO能够优于竞争方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)

自引率

0.00%

发文量