{"title":"Compressed In-memory Graphs for Accelerating GPU-based Analytics","authors":"Noushin Azami, Martin Burtscher","doi":"10.1109/IA356718.2022.00011","DOIUrl":"https://doi.org/10.1109/IA356718.2022.00011","url":null,"abstract":"Processing large graphs has become an important irregular workload. We present Massively Parallel Log Graphs (MPLG) to accelerate GPU graph codes, including highly optimized codes. MPLG combines a compressed in-memory repre-sentation with low-overhead parallel decompression. This yields a speedup if the boost in memory performance due to the reduced footprint outweighs the overhead of the extra instructions to decompress the graph on the fly. However, achieving a sufficiently low overhead is difficult, especially on GPUs with their high-bandwidth memory. Prior work has only successfully employed similar ideas on CPUs, but those approaches exhibit limited parallelism, making them unsuitable for GPUs. On large real-world inputs, MPLG speeds up graph analytics by up to 67% on a Titan V GPU. Averaged over 15 graphs from several domains, it improves the performance of Rodinia's breadth-first search by 11.9%, Gardenia's connected components by 5.8%, and ECL's graph coloring by 5.0%.","PeriodicalId":144759,"journal":{"name":"2022 IEEE/ACM Workshop on Irregular Applications: Architectures and Algorithms (IA3)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128732202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Message from the IA3 22 Workshop Chairs","authors":"","doi":"10.1109/ia356718.2022.00004","DOIUrl":"https://doi.org/10.1109/ia356718.2022.00004","url":null,"abstract":"","PeriodicalId":144759,"journal":{"name":"2022 IEEE/ACM Workshop on Irregular Applications: Architectures and Algorithms (IA3)","volume":"97 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127090441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. S. Labini, M. Bernaschi, W. Nutt, Francesco Silvestri, Flavio Vella
{"title":"Blocking Sparse Matrices to Leverage Dense-Specific Multiplication","authors":"P. S. Labini, M. Bernaschi, W. Nutt, Francesco Silvestri, Flavio Vella","doi":"10.1109/IA356718.2022.00009","DOIUrl":"https://doi.org/10.1109/IA356718.2022.00009","url":null,"abstract":"Research to accelerate matrix multiplication, pushed by the growing computational demands of deep learning, has sprouted many efficient architectural solutions, such as NVIDIA's Tensor Cores. These accelerators are designed to process efficiently a high volume of small dense matrix products in parallel. However, it is not obvious how to leverage these accelerators for sparse matrix multiplication. A natural way to adapt the accelerators to this problem is to divide the matrix into small blocks, and then multiply only the nonzero blocks. In this paper, we investigate ways to reorder the rows of a sparse matrix to reduce the number of nonzero blocks and cluster the nonzero elements into a few dense blocks. While this pre-processing can be computationally expensive, we show that the high speed-up provided by the accelerators can easily repay the cost, especially when several multiplications follow one reordering.","PeriodicalId":144759,"journal":{"name":"2022 IEEE/ACM Workshop on Irregular Applications: Architectures and Algorithms (IA3)","volume":"14 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134150843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hikaru Takayashiki, Masayuki Sato, K. Komatsu, Hiroaki Kobayashi
{"title":"Page-Address Coalescing of Vector Gather Instructions for Efficient Address Translation","authors":"Hikaru Takayashiki, Masayuki Sato, K. Komatsu, Hiroaki Kobayashi","doi":"10.1109/IA356718.2022.00007","DOIUrl":"https://doi.org/10.1109/IA356718.2022.00007","url":null,"abstract":"Vector gather instructions are available in various processors, which are essential for handling irregular memory accesses. Additionally, the processors support virtual memory that allows programmers not to consider the limitation of the physical memory space. To realize the virtual memory, the processors require address translation between virtual and physical addresses. When a vector gather instruction loads data elements distributed over the physical memory space, all virtual addresses must be translated one by one, causing many translations by accessing a Translation Lookaside Buffer (TLB). Hence, the TLB easily becomes a bottleneck in handling vector gather instructions. To relieve the bottleneck, this paper proposes an address coalescing method for the address translations of vector gather instructions by utilizing vector arithmetic units in the processor. The evaluation results show that the proposed method can achieve a 2x performance improvement in numerical and 1.08x in graph applications, which contain many vector gather instructions.","PeriodicalId":144759,"journal":{"name":"2022 IEEE/ACM Workshop on Irregular Applications: Architectures and Algorithms (IA3)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131393801","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"SparseLU, A Novel Algorithm and Math Library for Sparse LU Factorization","authors":"Pedro Valero-Lara, Cameron Greenwalt, J. Vetter","doi":"10.1109/IA356718.2022.00010","DOIUrl":"https://doi.org/10.1109/IA356718.2022.00010","url":null,"abstract":"Decomposing sparse matrices into lower and upper triangular matrices (sparse LU factorization) is a key operation in many computational scientific applications. We developed SparseLU, a sparse linear algebra library that implements a new algorithm for LU factorization on general sparse matrices. The new algorithm divides the input matrix into tiles to which OpenMP tasks are created for factorization computation, where only tiles that contain nonzero elements are computed. For comparative performance analysis, we used the reference library SuperLU. Testing was performed on synthetically generated matrices which replicate the conditions of the real-world matrices. SparseLU is able to reach a mean speedup of ~29× compared to SuperLU.","PeriodicalId":144759,"journal":{"name":"2022 IEEE/ACM Workshop on Irregular Applications: Architectures and Algorithms (IA3)","volume":"1198 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130603754","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerating Datalog applications with cuDF","authors":"Ahmedur Rahman Shovon, Landon Dyken, Oded Green, Thomas Gilray, Sidharth Kumar","doi":"10.1109/IA356718.2022.00012","DOIUrl":"https://doi.org/10.1109/IA356718.2022.00012","url":null,"abstract":"Datalog, a bottom-up declarative logic programming language, has a wide variety of uses for deduction, modeling, and data analysis, across application domains. Datalog can be efficiently implemented using relational algebra primitives such as join, projection and union. While there exist several multi-threaded and multi-core implementations of Datalog, targeting CPU-based systems, our work makes an inroad towards developing a Datalog implementation for GPUs. We demonstrate the feasibility of a high-performance relational algebra backend for a subset of Datalog applications that can effectively leverage the parallelism of GPUs using cuDF. cuDF is a library from the Rapids suite that uses the NVIDIA CUDA programming model for GPU parallelism. It provides similar functionalities to Pandas, a popular data analysis engine. In this paper, we analyze and evaluate the performance of cuDF versus Pandas for two graph-mining problems implemented in Datalog, (1) triangle counting and (2) transitive-closure computation.","PeriodicalId":144759,"journal":{"name":"2022 IEEE/ACM Workshop on Irregular Applications: Architectures and Algorithms (IA3)","volume":"179 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116666955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Evolution of a New Model of Computation","authors":"Brian A. Page, P. Kogge","doi":"10.1109/IA356718.2022.00008","DOIUrl":"https://doi.org/10.1109/IA356718.2022.00008","url":null,"abstract":"The conventional model of parallel programming today involves either copying data across cores (and then having to track its most recent value), or not copying and requiring deep software stacks to perform even the simplest operation on data that is “remote”, i.e., out of the range of loads and stores from the current core. As application requirements grow to larger data sets, with more irregular access to them, both conventional approaches start to exhibit severe scaling limitations. This paper reviews some growing evidence of the potential value of a new model of computation that skirts between the two: data does not move (i.e., is not copied), but computation instead moves to the data. Several different applications involving large sparse computations, streaming of data, and complex mixed mode operations have been coded for a novel platform where thread movement is handled invisibly by the hardware. The evidence to date indicates that parallel scaling for this paradigm can be significantly better than any mix of conventional models.","PeriodicalId":144759,"journal":{"name":"2022 IEEE/ACM Workshop on Irregular Applications: Architectures and Algorithms (IA3)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114394344","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}