POSTER: ξ-TAO:一个以缓存为中心的执行模型和运行时,用于深度并行多核拓扑

M. Pericàs
{"title":"POSTER: ξ-TAO:一个以缓存为中心的执行模型和运行时,用于深度并行多核拓扑","authors":"M. Pericàs","doi":"10.1145/2967938.2974052","DOIUrl":null,"url":null,"abstract":"We have analyzed the ξ-TAO model and runtime with three benchmarks: a parallel hybrid quicksort/mergesort, a 2D Jacobi stencil, and the Unbalanced Tree Search (UTS) benchmark. We run ξ-TAO implementations of these benchmarks on a Dell PowerEdge R815 server with four AMD Opteron 6348 processors, totalling 8 NUMA nodes and 48 cores. Figure 2 shows the scalability of UTS+ξ-TAO compared to thread-centric runtimes based on work stealing (MassiveThreads [6], Intel TBB) and hierarchical WS+PDF (Qthreads [10]). UTS was implemented in ξ-TAO by grouping sibling nodes into a TAO and attaching a static scheduler. UTS has a very small working set, hence the best performance is achieved when each TAO is mapped to a single core (ξ-TAO-w1). The combination of tight reuse, pre-built task groups and static scheduling results in high scalability for UTS+ξ-TAO. Unlike UTS, the parallel sorting and 2D Jacobi benchmarks are memory intensive benchmarks. By selecting assemblies of width two (i.e., core-width of the L2 caches) and six (i.e., core-width of the L3 cache) ξ-TAO is able to outperform competing approaches thanks to better management of available memory bandwidth and shared cache capacity.","PeriodicalId":407717,"journal":{"name":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","volume":"154 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"POSTER: ξ-TAO: A cache-centric execution model and runtime for deep parallel multicore topologies\",\"authors\":\"M. Pericàs\",\"doi\":\"10.1145/2967938.2974052\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We have analyzed the ξ-TAO model and runtime with three benchmarks: a parallel hybrid quicksort/mergesort, a 2D Jacobi stencil, and the Unbalanced Tree Search (UTS) benchmark. We run ξ-TAO implementations of these benchmarks on a Dell PowerEdge R815 server with four AMD Opteron 6348 processors, totalling 8 NUMA nodes and 48 cores. Figure 2 shows the scalability of UTS+ξ-TAO compared to thread-centric runtimes based on work stealing (MassiveThreads [6], Intel TBB) and hierarchical WS+PDF (Qthreads [10]). UTS was implemented in ξ-TAO by grouping sibling nodes into a TAO and attaching a static scheduler. UTS has a very small working set, hence the best performance is achieved when each TAO is mapped to a single core (ξ-TAO-w1). The combination of tight reuse, pre-built task groups and static scheduling results in high scalability for UTS+ξ-TAO. Unlike UTS, the parallel sorting and 2D Jacobi benchmarks are memory intensive benchmarks. By selecting assemblies of width two (i.e., core-width of the L2 caches) and six (i.e., core-width of the L3 cache) ξ-TAO is able to outperform competing approaches thanks to better management of available memory bandwidth and shared cache capacity.\",\"PeriodicalId\":407717,\"journal\":{\"name\":\"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)\",\"volume\":\"154 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2967938.2974052\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 International Conference on Parallel Architecture and Compilation Techniques (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2967938.2974052","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 5

摘要

我们用三个基准分析了ξ-TAO模型和运行时:并行混合快速排序/合并排序,2D Jacobi模板和不平衡树搜索(UTS)基准。我们在戴尔PowerEdge R815服务器上运行了这些基准测试的⊗- tao实现,该服务器配备了四个AMD Opteron 6348处理器,共有8个NUMA节点和48个核心。图2显示了UTS+ξ-TAO与基于工作窃取(MassiveThreads [6], Intel TBB)和分层WS+PDF (Qthreads[10])的以线程为中心的运行时相比的可扩展性。UTS在ξ-TAO中通过将兄弟节点分组到一个TAO中并附加一个静态调度器来实现。UTS具有非常小的工作集,因此当每个TAO映射到单个核心(ξ-TAO-w1)时,可以实现最佳性能。紧密复用、预构建任务组和静态调度相结合,使UTS+ξ-TAO具有较高的可扩展性。与UTS不同,并行排序和2D Jacobi基准测试是内存密集型基准测试。通过选择宽度为2(即L2缓存的核心宽度)和6(即L3缓存的核心宽度)的组件,由于更好地管理可用内存带宽和共享缓存容量,ξ-TAO能够优于竞争方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
POSTER: ξ-TAO: A cache-centric execution model and runtime for deep parallel multicore topologies
We have analyzed the ξ-TAO model and runtime with three benchmarks: a parallel hybrid quicksort/mergesort, a 2D Jacobi stencil, and the Unbalanced Tree Search (UTS) benchmark. We run ξ-TAO implementations of these benchmarks on a Dell PowerEdge R815 server with four AMD Opteron 6348 processors, totalling 8 NUMA nodes and 48 cores. Figure 2 shows the scalability of UTS+ξ-TAO compared to thread-centric runtimes based on work stealing (MassiveThreads [6], Intel TBB) and hierarchical WS+PDF (Qthreads [10]). UTS was implemented in ξ-TAO by grouping sibling nodes into a TAO and attaching a static scheduler. UTS has a very small working set, hence the best performance is achieved when each TAO is mapped to a single core (ξ-TAO-w1). The combination of tight reuse, pre-built task groups and static scheduling results in high scalability for UTS+ξ-TAO. Unlike UTS, the parallel sorting and 2D Jacobi benchmarks are memory intensive benchmarks. By selecting assemblies of width two (i.e., core-width of the L2 caches) and six (i.e., core-width of the L3 cache) ξ-TAO is able to outperform competing approaches thanks to better management of available memory bandwidth and shared cache capacity.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
copy
已复制链接
快去分享给好友吧!
我知道了
右上角分享
点击右上角分享
0
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信