Fast GPU Graph Contraction by Combining Efficient Shallow Searches and Post-Culling

2020 IEEE High Performance Extreme Computing Conference (HPEC) Pub Date : 2020-09-22 DOI:10.1109/HPEC43674.2020.9286141

Roozbeh Karimi, David M. Koppelman, C. J. Michael

{"title":"Fast GPU Graph Contraction by Combining Efficient Shallow Searches and Post-Culling","authors":"Roozbeh Karimi, David M. Koppelman, C. J. Michael","doi":"10.1109/HPEC43674.2020.9286141","DOIUrl":null,"url":null,"abstract":"Efficient GPU single-source shortest-path (SSSP) queries of road network graphs can be realized by a technique called PHAST (Delling et al.) in which the graph is contracted (pre-processed using Geisberger's Contraction Hierarchies) once and the resulting contracted graph is queried as needed. PHAST accommodates GPUs' parallelism requirements well, resulting in efficient queries. For situations in which a graph is not available well in advance or changes frequently contraction time itself becomes significant. Karimi et al. recently described a GPU contraction technique, CU-CH, which significantly reduces the contraction time of small-to medium-sized graphs, reporting a speedup of over 20× on an NVidia P100 GPU. However CU-CH realizes little speedup on larger graphs, such as DIMACS’ USA and W. Europe graphs. The obstacle to faster contraction of larger graphs is the frequently performed witness path search (WPS). A WPS for a node determines which shortcut edges need to be added between the node's neighbors to maintain distances after the removal of the node. GPUs' strict thread convergence requirements and limited scratchpad preclude the bidirectional Dijkstra approach used in CPU implementations. Instead, CU-CH uses a two-hop-limit WPS tightly coded to fit GPU shared storage and to maintain thread convergence. Where two hops is sufficient speedup is high, but for larger graphs the hop limit exacts a toll due to the accumulation of unneeded shortcuts. The problem is overcome here by retaining the efficient CU-CH WPS but using it both for its original purpose and also to identify unnecessary shortcuts added in prior steps. The unnecessary shortcuts are culled (removed). Culling shortcuts not only dramatically reduces the time needed to contract a graph but also improves the quality of the contracted graph. For smaller graphs such as DIMACS Cal (travel time) contraction time is 61 % faster compared to CU-CH. For the DIMACS Europe and USA graphs contraction times are 40× and 12× faster, respectively. SSSP query times also improve dramatically, approaching those obtained on aggressively contracted graphs. The speedup over Geisberger's CPU code is over 100 times for NVidia VI00 GPUs on most graphs tried.","PeriodicalId":168544,"journal":{"name":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE High Performance Extreme Computing Conference (HPEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPEC43674.2020.9286141","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Efficient GPU single-source shortest-path (SSSP) queries of road network graphs can be realized by a technique called PHAST (Delling et al.) in which the graph is contracted (pre-processed using Geisberger's Contraction Hierarchies) once and the resulting contracted graph is queried as needed. PHAST accommodates GPUs' parallelism requirements well, resulting in efficient queries. For situations in which a graph is not available well in advance or changes frequently contraction time itself becomes significant. Karimi et al. recently described a GPU contraction technique, CU-CH, which significantly reduces the contraction time of small-to medium-sized graphs, reporting a speedup of over 20× on an NVidia P100 GPU. However CU-CH realizes little speedup on larger graphs, such as DIMACS’ USA and W. Europe graphs. The obstacle to faster contraction of larger graphs is the frequently performed witness path search (WPS). A WPS for a node determines which shortcut edges need to be added between the node's neighbors to maintain distances after the removal of the node. GPUs' strict thread convergence requirements and limited scratchpad preclude the bidirectional Dijkstra approach used in CPU implementations. Instead, CU-CH uses a two-hop-limit WPS tightly coded to fit GPU shared storage and to maintain thread convergence. Where two hops is sufficient speedup is high, but for larger graphs the hop limit exacts a toll due to the accumulation of unneeded shortcuts. The problem is overcome here by retaining the efficient CU-CH WPS but using it both for its original purpose and also to identify unnecessary shortcuts added in prior steps. The unnecessary shortcuts are culled (removed). Culling shortcuts not only dramatically reduces the time needed to contract a graph but also improves the quality of the contracted graph. For smaller graphs such as DIMACS Cal (travel time) contraction time is 61 % faster compared to CU-CH. For the DIMACS Europe and USA graphs contraction times are 40× and 12× faster, respectively. SSSP query times also improve dramatically, approaching those obtained on aggressively contracted graphs. The speedup over Geisberger's CPU code is over 100 times for NVidia VI00 GPUs on most graphs tried.

查看原文本刊更多论文

结合高效浅搜索和后淘汰的快速GPU图收缩

高效的GPU单源最短路径(SSSP)查询道路网络图可以通过一种称为PHAST (Delling等人)的技术来实现，在这种技术中，图被压缩一次(使用Geisberger的收缩层次进行预处理)，并根据需要查询得到的收缩图。PHAST很好地适应了gpu的并行性要求，从而实现了高效的查询。对于图表不能提前很好地获得或经常变化的情况，收缩时间本身就变得很重要。Karimi等人最近描述了一种GPU收缩技术CU-CH，它显着减少了中小型图形的收缩时间，报告在NVidia P100 GPU上加速超过20倍。然而，CU-CH在较大的图上实现的加速很小，例如DIMACS的USA和W. Europe图。快速收缩较大图的障碍是频繁执行见证路径搜索(WPS)。节点的WPS决定在节点移除后需要在节点的邻居之间添加哪些快捷边来保持距离。gpu严格的线程收敛要求和有限的刮擦板排除了CPU实现中使用的双向Dijkstra方法。相反，CU-CH使用严格编码的两跳限制WPS，以适应GPU共享存储并保持线程收敛。在两个跳数足够的情况下，加速速度很高，但对于较大的图，由于不必要的快捷方式的积累，跳数限制会造成损失。这里通过保留高效的CU-CH WPS来克服这个问题，但同时将其用于其原始目的，并识别在先前步骤中添加的不必要的快捷方式。不必要的快捷方式被剔除(删除)。选择快捷方式不仅大大减少了收缩图所需的时间，而且提高了收缩图的质量。对于较小的图形，如DIMACS Cal(旅行时间)，收缩时间比CU-CH快61%。对于DIMACS欧洲和美国图，收缩时间分别快了40倍和12倍。SSSP查询时间也显著提高，接近在积极收缩图上获得的查询时间。在大多数图形上，Geisberger的CPU代码的加速是NVidia VI00 gpu的100倍以上。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2020 IEEE High Performance Extreme Computing Conference (HPEC)

自引率

0.00%

发文量