{"title":"Provably Fast and Space-Efficient Parallel Biconnectivity (Abstract)","authors":"Xiaojun Dong, Letong Wang, Yan Gu, Yihan Sun","doi":"10.1145/3597635.3598018","DOIUrl":"https://doi.org/10.1145/3597635.3598018","url":null,"abstract":"We propose the first parallel biconnectivity algorithm (FAST-BCC) that has optimal work, polylogarithmic span, and is space-efficient. Our algorithm creates a skeleton graph based on any spanning tree of the input graph. Then we use the connectivity information of the skeleton to compute the biconnectivity of the original input. We carefully analyze the correctness of our algorithm. We implemented FAST-BCC and compared it with existing implementations, including GBBS, Slota and Madduri's algorithm, and the sequential Hopcroft-Tarjan algorithm. We tested them on a 96-core machine on 27 graphs with varying edge distributions. FAST-BCC is faster than all existing baselines on each graph.","PeriodicalId":185981,"journal":{"name":"Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128526751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Empirical Challenge for NC Theory (Abstract)","authors":"Ananth Hari, U. Vishkin","doi":"10.1145/3597635.3598020","DOIUrl":"https://doi.org/10.1145/3597635.3598020","url":null,"abstract":"Horn-satisfiability or Horn-SAT is the problem of deciding whether a satisfying assignment exists for a Horn formula, a conjunction of clauses each with at most one positive literal (also known as Horn clauses). It is a well-known P-complete problem, which implies that unless P = NC, it is a hard problem to parallelize. In this paper, we empirically show that, under a known simple random model for generating the Horn formula, the ratio of hard-to-parallelize instances (closer to the worst-case behavior) is infinitesimally small. We show that the depth of a parallel algorithm for Horn-SAT is polylogarithmic on average, for almost all instances, while keeping the work linear. This challenges theoreticians and programmers to look beyond worst-case analysis and come up with practical algorithms coupled with respective performance guarantees.","PeriodicalId":185981,"journal":{"name":"Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123963023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Smarter Atomic Smart Pointers: Safe and Efficient Concurrent Memory Management (Abstract)","authors":"Daniel Anderson, G. Blelloch, Yuanhao Wei","doi":"10.1145/3597635.3598027","DOIUrl":"https://doi.org/10.1145/3597635.3598027","url":null,"abstract":"We present a technique for concurrent memory management that combines the ease-of-use of automatic memory reclamation, and the efficiency of state-of-the-art deferred reclamation algorithms. First, we combine ideas from referencing counting and hazard pointers in a novel way to implement automatic concurrent reference counting with wait-free, constant-time overhead. Second, we generalize our previous algorithm to obtain a method for converting any standard manual SMR technique into an automatic reference counting technique with a similar performance profile. We have implemented the approach as a C++ library and compared it experimentally to existing atomic reference-counting libraries and state-of-the-art manual techniques. Our results indicate that our technique is faster than existing reference-counting implementations, and competitive with manual memory reclamation techniques. More importantly, it is significantly safer than manual techniques since objects are reclaimed automatically.","PeriodicalId":185981,"journal":{"name":"Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing","volume":"156 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114799277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Efficient Construction of Directed Hopsets and Parallel Single-source Shortest Paths (Abstract)","authors":"Nairen Cao, Jeremy T. Fineman, Katina Russell","doi":"10.1145/3597635.3598019","DOIUrl":"https://doi.org/10.1145/3597635.3598019","url":null,"abstract":"The single-source shortest-path problem is as follows: given a graph with nonnegative edge weights and a designated source vertex s, return the distances from~s to each other vertex such. This paper presents a randomized parallel single-source shortest paths (SSSP) algorithm for directed graphs with non-negative integer edge weights that solves the exact SSSP problem in O (m) work and n^1/2+o(1) span, with high probability. All previous exact SSSP algorithms with nearly linear work have linear span, even for undirected unweighted graphs. To solve exact SSSP problem, we first show a deterministic reduction from exact SSSP to directed hopsets using the iterative gradual rounding technique. An (β, ε)-hopset is a set of weighted edges, also known as shortcuts, that when added to the graph, admit β-hop paths with weights no more than (1 + ε) times the true shortest path distances. We show that (β, ε)-hopsets can be used to solve the exact SSSP problem in O (m) work and O (β) span. Furthermore, we present the first nearly linear-work algorithm for constructing hopsets on directed graphs. Our sequential algorithm runs in O (m) time and constructs a hopset with O (n) edges and β = n^1/2+o(1) . We also provide a parallel version of the algorithm with O (m) work and n^1/2+o(1) span. The directed hopsets can be used to solve approximate SSSP problems efficiently, where the objective is to return estimates of the distances from the source vertex to every other vertex such that the estimate falls between the true distance and (1+ε) times the distance. Specifically, for constant ε and graphs with polynomially-bounded real edge weights, there is an algorithm solving approximate SSSP problem with O (m) work and n^1/2+o(1) span.","PeriodicalId":185981,"journal":{"name":"Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123777515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Taming Misaligned Graph Traversals in Concurrent Graph Processing (Abstract)","authors":"Xizhe Yin, Zhijia Zhao, Rajiv Gupta","doi":"10.1145/3597635.3598028","DOIUrl":"https://doi.org/10.1145/3597635.3598028","url":null,"abstract":"This work introduces Glign, a runtime system that automatically aligns the graph traversals for concurrent queries. Glign introduces three levels of graph traversal alignment for iterative evaluation of concurrent queries. First, it synchronizes the accesses of different queries to the active parts of the graph within each iteration of the evaluation---intra-iteration alignment. On top of that, Glign leverages a key insight regarding the \"heavy iterations\" in query evaluation to achieveinter-iteration alignment andalignment-aware batching. The former aligns the iterations of different queries to increase the graph access sharing, while the latter tries to group queries of better graph access sharing into the same evaluation batch. Together, these alignment techniques can substantially boost the data locality of concurrent query evaluation. Based on our experiments, Glign outperforms the state-of-the-art concurrent graph processing systems Krill and GraphM by 3.6× and 4.7× on average, respectively.","PeriodicalId":185981,"journal":{"name":"Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122116978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel Strong Connectivity Based on Faster Reachability (Abstract)","authors":"Letong Wang, Xiaojun Dong, Yan Gu, Yihan Sun","doi":"10.1145/3597635.3598017","DOIUrl":"https://doi.org/10.1145/3597635.3598017","url":null,"abstract":"In this paper, we propose a parallel strongly connected components (SCC) implementation that is efficient on a wide range of graphs. Our speedup comes from two novel techniques: vertical granularity control (VGC) and parallel hash bag.","PeriodicalId":185981,"journal":{"name":"Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133501065","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast Parallel Algorithms for Euclidean Minimum Spanning Tree and Hierarchical Spatial Clustering (Abstract)","authors":"Yiqiu Wang, Shangdi Yu, Yan Gu, Julian Shun","doi":"10.1145/3597635.3598025","DOIUrl":"https://doi.org/10.1145/3597635.3598025","url":null,"abstract":"This paper presents new parallel algorithms for generating Euclidean minimum spanning trees and spatial clustering hierarchies (known as HDBSCAN^*). Our approach is based on generating a well-separated pair decomposition followed by using Kruskal's minimum spanning tree algorithm and bichromatic closest pair computations. We introduce a new notion of well-separation to reduce the work and space of our algorithm for HDBSCAN^*. We also give a new parallel divide-and-conquer algorithm for computing the dendrogram and reachability plots, which are used in visualizing clusters of different scale that arise for both EMST and HDBSCAN^*. We show that our algorithms are theoretically efficient: they have work (number of operations) matching their sequential counterparts, and polylogarithmic depth (parallel time). We implement our algorithms and propose a memory optimization that requires only a subset of well-separated pairs to be computed and materialized, leading to savings in both space (up to 10x) and time (up to 8x). Our experiments on large real-world and synthetic data sets using a 48-core machine show that our fastest algorithms outperform the best serial algorithms for the problems by 11.13--55.89x, and existing parallel algorithms by at least an order of magnitude.","PeriodicalId":185981,"journal":{"name":"Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125118289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Static Prediction of Parallel Computation Graphs (Abstract)","authors":"Stefan K. Muller","doi":"10.1145/3597635.3598026","DOIUrl":"https://doi.org/10.1145/3597635.3598026","url":null,"abstract":"Many results in the theory of parallel scheduling, dating back to Brent's Theorem, are expressed in terms of the parallel dependency structure of a program as represented by a Directed Acyclic Graph (DAG). In the world of parallel and concurrent program analysis, such DAG models are also used to study deadlock, data races, and priority inversions, to name just a few examples. In all of these cases, it tends to be convenient to think of the DAG as a model of the program itself-we might say, for example, that the time to run a parallel program on P processors depends on the work and span of the program's DAG. This assumes that the DAG is a static, predictable property of the program. In reality, however, a DAG typically models the runtime relationships between threads during a particular execution of a program. To obtain the DAG, one might simulate an execution (or all possible executions) using some form of cost semantics, a dynamic semantics that produces the DAG as it executes the program. In fine-grained parallel programs, such as those that result from constructs such as fork/join, spawn/sync, async/finish, and futures, these DAGs tend to be especially dynamic and dependent on the features of a particular execution. For example, a divide-and-conquer algorithm implemented using fork/join parallelism may divide a certain number of times depending on the input size, and a program written with futures can choose to wait on threads or not wait on threads depending on conditions available only at runtime. Such programs are best represented by a (possibly infinite) family of DAGs, representing all possible executions of the program.","PeriodicalId":185981,"journal":{"name":"Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130258529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CommonGraph: Graph Analytics on Evolving Data (Abstract)","authors":"Mahbod Afarin, Chao Gao, Shafiur Rahman, Nael B. Abu-Ghazaleh, Rajiv Gupta","doi":"10.1145/3597635.3598022","DOIUrl":"https://doi.org/10.1145/3597635.3598022","url":null,"abstract":"We consider the problem of graph analytics on evolving graphs. In this scenario, a query typically needs to be applied to different snapshots of the graph over an extended time window. We propose CommonGraph, an approach for efficient processing of queries on evolving graphs. We first observe that edge deletions are significantly more expensive than addition operations. CommonGraph converts all deletions to additions by finding a common graph that exists across all snapshots. After computing the query on this graph, to reach any snapshot, we simply need to add the missing edges and incrementally update the query results. CommonGraph also allows sharing of common additions among snapshots that require them, and breaks the sequential dependency inherent in the traditional streaming approach where snapshots are processed in sequence, enabling additional opportunities for parallelism. We incorporate the CommonGraph approach by extending the KickStarter streaming framework. CommonGraph achieves 1.38x-8.17x improvement in performance over Kickstarter across multiple benchmarks.","PeriodicalId":185981,"journal":{"name":"Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132643537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Toluwanimi O. Odemuyiwa, Hadi Asghari-Moghaddam, Michael Pellauer, Kartik Hegde, Po-An Tsai, N. Crago, A. Jaleel, J. Owens, Edgar Solomonik, J. Emer, Christopher W. Fletcher
{"title":"Accelerating Sparse Data Orchestration via Dynamic Reflexive Tiling (Extended Abstract)","authors":"Toluwanimi O. Odemuyiwa, Hadi Asghari-Moghaddam, Michael Pellauer, Kartik Hegde, Po-An Tsai, N. Crago, A. Jaleel, J. Owens, Edgar Solomonik, J. Emer, Christopher W. Fletcher","doi":"10.1145/3597635.3598031","DOIUrl":"https://doi.org/10.1145/3597635.3598031","url":null,"abstract":"Tensor algebra involving multiple sparse operands is severely memory bound, making it a challenging target for acceleration. Furthermore, irregular sparsity complicates traditional techniques---such as tiling---for ameliorating memory bottlenecks. Prior sparse tiling schemes are sparsity unaware: they carve tensors into uniform coordinate-space shapes, which leads to low-occupancy tiles and thus lower exploitable reuse. To address these challenges, this paper proposes dynamic reflexive tiling (DRT), a novel tiling method that improves data reuse over prior art for sparse tensor kernels, unlocking significant performance improvement opportunities. DRT's key idea is dynamic sparsity-aware tiling. DRT continuously re-tiles sparse tensors at runtime based on the current sparsity of the active regions of all input tensors, to maximize accelerator buffer utilization while retaining the ability to co-iterate through tiles of distinct tensors. Through an extensive evaluation over a set of SuiteSparse matrices, we show how DRT can be applied to multiple prior accelerators with different dataflows (ExTensor, OuterSPACE, MatRaptor), improving their performance (by 3.3x, 5.1x, and 1.6x, respectively) while adding negligible area overhead. We apply DRT to higher-order tensor kernels to reduce DRAM traffic by 3.9x and 16.9x over a CPU implementation and prior-art tiling scheme, respectively. Finally, we show that the technique is portable to software, with an improvement of 7.29x and 2.94x in memory overhead compared to untiled sparse-sparse matrix multiplication (SpMSpM).","PeriodicalId":185981,"journal":{"name":"Proceedings of the 2023 ACM Workshop on Highlights of Parallel Computing","volume":"95 Suppl A 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2023-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116920170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}