POSTER: ξ-TAO:一个以缓存为中心的执行模型和运行时，用于深度并行多核拓扑

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT) Pub Date : 2016-09-11 DOI:10.1145/2967938.2974052

M. Pericàs

{"title":"POSTER: ξ-TAO:一个以缓存为中心的执行模型和运行时，用于深度并行多核拓扑","authors":"M. Pericàs","doi":"10.1145/2967938.2974052","DOIUrl":null,"url":null,"abstract":"We have analyzed the ξ-TAO model and runtime with three benchmarks: a parallel hybrid quicksort/mergesort, a 2D Jacobi stencil, and the Unbalanced Tree Search (UTS) benchmark. We run ξ-TAO implementations of these benchmarks on a Dell PowerEdge R815 server with four AMD Opteron 6348 processors, totalling 8 NUMA nodes and 48 cores. Figure 2 shows the scalability of UTS+ξ-TAO compared to thread-centric runtimes based on work stealing (MassiveThreads [6], Intel TBB) and hierarchical WS+PDF (Qthreads [10]). UTS was implemented in ξ-TAO by grouping sibling nodes into a TAO and attaching a static scheduler. UTS has a very small working set, hence the best performance is achieved when each TAO is mapped to a single core (ξ-TAO-w1). The combination of tight reuse, pre-built task groups and static scheduling results in high scalability for UTS+ξ-TAO. Unlike UTS, the parallel sorting and 2D Jacobi benchmarks are memory intensive benchmarks. By selecting assemblies of width two (i.e., core-width of the L2 caches) and six (i.e., core-width of the L3 cache) ξ-TAO is able to outperform competing approaches thanks to better management of available memory bandwidth and shared cache capacity.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"154 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"POSTER: ξ-TAO: A cache-centric execution model and runtime for deep parallel multicore topologies\",\"authors\":\"M. Pericàs\",\"doi\":\"10.1145/2967938.2974052\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We have analyzed the ξ-TAO model and runtime with three benchmarks: a parallel hybrid quicksort/mergesort, a 2D Jacobi stencil, and the Unbalanced Tree Search (UTS) benchmark. We run ξ-TAO implementations of these benchmarks on a Dell PowerEdge R815 server with four AMD Opteron 6348 processors, totalling 8 NUMA nodes and 48 cores. Figure 2 shows the scalability of UTS+ξ-TAO compared to thread-centric runtimes based on work stealing (MassiveThreads [6], Intel TBB) and hierarchical WS+PDF (Qthreads [10]). UTS was implemented in ξ-TAO by grouping sibling nodes into a TAO and attaching a static scheduler. UTS has a very small working set, hence the best performance is achieved when each TAO is mapped to a single core (ξ-TAO-w1). The combination of tight reuse, pre-built task groups and static scheduling results in high scalability for UTS+ξ-TAO. Unlike UTS, the parallel sorting and 2D Jacobi benchmarks are memory intensive benchmarks. By selecting assemblies of width two (i.e., core-width of the L2 caches) and six (i.e., core-width of the L3 cache) ξ-TAO is able to outperform competing approaches thanks to better management of available memory bandwidth and shared cache capacity.\",\"PeriodicalId\":407717,\"journal\":{\"name\":\"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)\",\"volume\":\"154 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2967938.2974052\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2967938.2974052","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

我们用三个基准分析了ξ-TAO模型和运行时:并行混合快速排序/合并排序，2D Jacobi模板和不平衡树搜索(UTS)基准。我们在戴尔PowerEdge R815服务器上运行了这些基准测试的⊗- tao实现，该服务器配备了四个AMD Opteron 6348处理器，共有8个NUMA节点和48个核心。图2显示了UTS+ξ-TAO与基于工作窃取(MassiveThreads [6]， Intel TBB)和分层WS+PDF (Qthreads[10])的以线程为中心的运行时相比的可扩展性。UTS在ξ-TAO中通过将兄弟节点分组到一个TAO中并附加一个静态调度器来实现。UTS具有非常小的工作集，因此当每个TAO映射到单个核心(ξ-TAO-w1)时，可以实现最佳性能。紧密复用、预构建任务组和静态调度相结合，使UTS+ξ-TAO具有较高的可扩展性。与UTS不同，并行排序和2D Jacobi基准测试是内存密集型基准测试。通过选择宽度为2(即L2缓存的核心宽度)和6(即L3缓存的核心宽度)的组件，由于更好地管理可用内存带宽和共享缓存容量，ξ-TAO能够优于竞争方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文本刊更多论文

POSTER: ξ-TAO: A cache-centric execution model and runtime for deep parallel multicore topologies

We have analyzed the ξ-TAO model and runtime with three benchmarks: a parallel hybrid quicksort/mergesort, a 2D Jacobi stencil, and the Unbalanced Tree Search (UTS) benchmark. We run ξ-TAO implementations of these benchmarks on a Dell PowerEdge R815 server with four AMD Opteron 6348 processors, totalling 8 NUMA nodes and 48 cores. Figure 2 shows the scalability of UTS+ξ-TAO compared to thread-centric runtimes based on work stealing (MassiveThreads [6], Intel TBB) and hierarchical WS+PDF (Qthreads [10]). UTS was implemented in ξ-TAO by grouping sibling nodes into a TAO and attaching a static scheduler. UTS has a very small working set, hence the best performance is achieved when each TAO is mapped to a single core (ξ-TAO-w1). The combination of tight reuse, pre-built task groups and static scheduling results in high scalability for UTS+ξ-TAO. Unlike UTS, the parallel sorting and 2D Jacobi benchmarks are memory intensive benchmarks. By selecting assemblies of width two (i.e., core-width of the L2 caches) and six (i.e., core-width of the L3 cache) ξ-TAO is able to outperform competing approaches thanks to better management of available memory bandwidth and shared cache capacity.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)

自引率

0.00%

发文量