An Efficient Transaction-Based GPU Implementation of Minimum Spanning Forest Algorithm

2017 International Conference on High Performance Computing & Simulation (HPCS) Pub Date : 2017-07-01 DOI:10.1109/HPCS.2017.100

Shayan Manoochehri, B. Goodarzi, D. Goswami

{"title":"An Efficient Transaction-Based GPU Implementation of Minimum Spanning Forest Algorithm","authors":"Shayan Manoochehri, B. Goodarzi, D. Goswami","doi":"10.1109/HPCS.2017.100","DOIUrl":null,"url":null,"abstract":"General Purpose GPUs (GPGPUs) are ideal platforms for parallel execution of applications with regular shared memory access patterns. However, majority of real world multithreaded applications require access to shared memory with irregular patterns. The Minimum Spanning Forest (MSF) calculation arises in many real world applications. The Boruvka's algorithm for calculating MSF has the most expressed parallelism; however, it is a challenging irregular algorithm to implement on GPUs. In this paper we show that a transaction- based design and implementation of the Boruvka's algorithm on GPU can handle some of the challenges arising due to irregularity. First, we identify the hotspots of the algorithm that are the main bottlenecks: edge discovery and merge. The edge discovery phase is implemented using lock-free synchronizations after extracting certain algebraic properties (e.g. monotonicity) of the computation. The merge phase, however, lacks such algebraic properties and hence we utilize a Software Transactional Memory (STM) based synchronization method. STM offers ease of use by guaranteeing deadlock/livelock-free behavior as opposed to blocking lock-based synchronization. It also increases programmability by providing high level abstractions for synchronization which facilitate a natural transition from algorithm design to implementation. In addition, we employ several optimization techniques in different phases of the algorithm to achieve load balance and enhanced GPU resource utilization. Experimental results show that our GPU-based implementation outperforms both the fastest sequential implementation and the existing STM-based implementation on multicore CPUs when tested on large-scale graphs with diverse densities.","PeriodicalId":115758,"journal":{"name":"2017 International Conference on High Performance Computing & Simulation (HPCS)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 International Conference on High Performance Computing & Simulation (HPCS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/HPCS.2017.100","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

Abstract

General Purpose GPUs (GPGPUs) are ideal platforms for parallel execution of applications with regular shared memory access patterns. However, majority of real world multithreaded applications require access to shared memory with irregular patterns. The Minimum Spanning Forest (MSF) calculation arises in many real world applications. The Boruvka's algorithm for calculating MSF has the most expressed parallelism; however, it is a challenging irregular algorithm to implement on GPUs. In this paper we show that a transaction- based design and implementation of the Boruvka's algorithm on GPU can handle some of the challenges arising due to irregularity. First, we identify the hotspots of the algorithm that are the main bottlenecks: edge discovery and merge. The edge discovery phase is implemented using lock-free synchronizations after extracting certain algebraic properties (e.g. monotonicity) of the computation. The merge phase, however, lacks such algebraic properties and hence we utilize a Software Transactional Memory (STM) based synchronization method. STM offers ease of use by guaranteeing deadlock/livelock-free behavior as opposed to blocking lock-based synchronization. It also increases programmability by providing high level abstractions for synchronization which facilitate a natural transition from algorithm design to implementation. In addition, we employ several optimization techniques in different phases of the algorithm to achieve load balance and enhanced GPU resource utilization. Experimental results show that our GPU-based implementation outperforms both the fastest sequential implementation and the existing STM-based implementation on multicore CPUs when tested on large-scale graphs with diverse densities.

查看原文本刊更多论文

一种高效的基于事务的最小生成森林算法GPU实现

通用gpu (gpgpu)是具有常规共享内存访问模式的应用程序并行执行的理想平台。然而，现实世界中的大多数多线程应用程序都需要以不规则的模式访问共享内存。最小生成森林(MSF)计算出现在许多实际应用程序中。计算MSF的Boruvka算法具有最明显的并行性;然而，在gpu上实现它是一个具有挑战性的不规则算法。在本文中，我们证明了基于事务的设计和实现的Boruvka算法在GPU上可以处理一些由于不规则性而产生的挑战。首先，我们确定算法的热点，即主要瓶颈:边缘发现和合并。边缘发现阶段是在提取计算的某些代数属性(例如单调性)后使用无锁同步实现的。然而，合并阶段缺乏这样的代数属性，因此我们使用基于软件事务性内存(STM)的同步方法。STM通过保证无死锁/活锁的行为而不是阻塞基于锁的同步，从而提供了易用性。它还通过为同步提供高级抽象来提高可编程性，从而促进从算法设计到实现的自然过渡。此外，我们在算法的不同阶段采用了几种优化技术来实现负载平衡和提高GPU资源利用率。实验结果表明，在不同密度的大规模图上进行测试时，基于gpu的实现在多核cpu上的性能优于最快的顺序实现和现有的基于stm的实现。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文求助全文

来源期刊

2017 International Conference on High Performance Computing & Simulation (HPCS)

自引率

0.00%

发文量