{"title":"No More Leaky PageRank","authors":"Scott Sallinen, M. Ripeanu","doi":"10.1109/IA354616.2021.00011","DOIUrl":"https://doi.org/10.1109/IA354616.2021.00011","url":null,"abstract":"We have surveyed multiple PageRank implementations available with popular graph processing frameworks, and discovered that they treat sink vertices (i.e., vertices without outgoing edges) incorrectly. This leads to two issues: (i) incorrect PageRank scores, and (ii) flawed performance evaluations (as costly scatter operations are avoided). For synchronous PageRank implementations, a strategy to fix these issues exists (accumu-lating all values from sinks during an algorithmic superstep of a PageRank iteration), albeit with sizeable overhead. This solution, however, is not applicable in the context of asynchronous frameworks. We present and evaluate a novel, low-cost algorithmic solution to address this issue. For asynchronous PageRank, our key target, our solution simply requires an inexpensive O(Vertex) computation performed alongside the final normalization step. We also show that this strategy has advantages over prior work for synchronous PageRank, as it both avoids graph restructuring and reduces inline computation costs by performing a final score reassignment to vertices once at the end of processing.","PeriodicalId":415158,"journal":{"name":"2021 IEEE/ACM 11th Workshop on Irregular Applications: Architectures and Algorithms (IA3)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129485671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Proceedings of IA3 2021: Workshop on Irregular Applications: Architectures and Algorithms [Title page]","authors":"","doi":"10.1109/ia354616.2021.00001","DOIUrl":"https://doi.org/10.1109/ia354616.2021.00001","url":null,"abstract":"","PeriodicalId":415158,"journal":{"name":"2021 IEEE/ACM 11th Workshop on Irregular Applications: Architectures and Algorithms (IA3)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124112197","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerating unstructured-grid CFD algorithms on NVIDIA and AMD GPUs","authors":"C. Stone, Aaron C. Walden, M. Zubair, E. Nielsen","doi":"10.1109/IA354616.2021.00010","DOIUrl":"https://doi.org/10.1109/IA354616.2021.00010","url":null,"abstract":"Computational performance of the FUN3D unstructured-grid computational fluid dynamics (CFD) application on GPUs is highly dependent on the efficiency of floating-point atomic updates needed to support the irregular cell-, edge-, and node-based data access patterns in massively parallel GPU environments. We examine several optimization methods to improve GPU efficiency of performance-critical kernels that are dominated by atomic update costs on NVIDIA V100/A100and AMD CDNA MI100 GPUs. Optimization on the AMD MI100 GPU was of primary interest since similar hardware will be used in the upcoming Frontier supercomputer. Techniques combining register shuffling and on-chip shared memory were used to transpose and/or aggregate results amongst collaborating GPU threads before atomically updating global memory. These techniques, along with algorithmic optimizations to reduce the update frequency, reduced the run-time of select kernels on the MI100 GPU by a factor of between 2.5 and 6.0 over atomically updating global memory directly. Performance impact on the NVIDIA GPUs was mixed with the performance of the V100 often degraded when using register-based aggregation/transposition techniques while the A100 generally benefited from these methods, though to a lesser extent than measured on the MI100 GPU. Overall, both V100 and A100 GPUs outperformed the MI100 GPU on kernels dominated by double-precision atomic updates; however, the techniques demonstrated here reduced the performance gap and improved the MI100 performance.","PeriodicalId":415158,"journal":{"name":"2021 IEEE/ACM 11th Workshop on Irregular Applications: Architectures and Algorithms (IA3)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129996719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Greatly Accelerated Scaling of Streaming Problems with A Migrating Thread Architecture","authors":"Brian A. Page, P. Kogge","doi":"10.1109/IA354616.2021.00009","DOIUrl":"https://doi.org/10.1109/IA354616.2021.00009","url":null,"abstract":"Applications where continuous streams of data are passed through large data structures are becoming of increasing importance. However, their execution on conventional architectures, especially when parallelism is desired to boost performance, is highly inefficient. The primary issue is often with the need to stream large numbers of disparate data items through the equivalent of very large hash tables distributed across many nodes. This paper builds on some prior work on the Firehose streaming benchmark where an emerging architecture using threads that can migrate through memory has shown to be much more efficient at such problems. This paper extends that work to use a second generation system to not only show that same improved efficiency (10X) for larger core counts, but even significantly higher raw performance (with FPGA-based cores running at 1/10th the clock of conventional systems). Further, this additional data yields insight into what resources represent the bottlenecks to even more performance, and make a reasonable projection that implementation of such an architecture with current technology would lead to 10X performance gain on an apples-to-apples basis with conventional systems.","PeriodicalId":415158,"journal":{"name":"2021 IEEE/ACM 11th Workshop on Irregular Applications: Architectures and Algorithms (IA3)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125511991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jinhao Chen, T. Davis, Christopher Lourenco, Erick Moreno-Centeno
{"title":"Sparse Exact Factorization Update","authors":"Jinhao Chen, T. Davis, Christopher Lourenco, Erick Moreno-Centeno","doi":"10.1109/IA354616.2021.00012","DOIUrl":"https://doi.org/10.1109/IA354616.2021.00012","url":null,"abstract":"To meet the growing need for extended or exact precision solvers, an efficient framework based on Integer-Preserving Gaussian Elimination (IPGE) has been recently developed which includes dense/sparse LU/Cholesky factorizations and dense LU/Cholesky factorization updates for column and/or row replacement. In this paper, we discuss our on-going work developing the sparse LU/Cholesky column/row-replacement update and the sparse rank-l update/downdate. We first present some basic background for the exact factorization framework based on IPGE. Then we give our proposed algorithms along with some implementation and data-structure details. Finally, we provide some experimental results showcasing the performance of our update algorithms. Specifically, we show that updating these exact factorizations can be typically 10x to 100x faster than (re-)factorizing the matrices from scratch.","PeriodicalId":415158,"journal":{"name":"2021 IEEE/ACM 11th Workshop on Irregular Applications: Architectures and Algorithms (IA3)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132368924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
P. Pirkelbauer, Seth Bromberger, Keita Iwabuchi, R. Pearce
{"title":"Towards Scalable Data Processing in Python with CLIPPy","authors":"P. Pirkelbauer, Seth Bromberger, Keita Iwabuchi, R. Pearce","doi":"10.1109/IA354616.2021.00013","DOIUrl":"https://doi.org/10.1109/IA354616.2021.00013","url":null,"abstract":"The Python programming language has become a popular choice for data scientists. While easy to use, the Python language is not well suited to drive data science on large-scale systems. This paper presents a first prototype of CLIPPy (Command line interface plus Python), a user-side class in Python that connects to high-performance computing environments with nonvolatile memory (NVM). CLIPPy queries available executable files and prepares a Python API on the fly. The executables can connect to a backend that executes on a large-scale system. The executables can be implemented in any language, for example in C++. CLIPPy and the executables are loosely coupled and communicate through a JSON based interface. By storing data in NVM, executables can attach and detach to data structures without expensive format conversions. The Underlying Philosophy, Design Challenges, and a Prototype Implementation that Accesses Data Stored in Non-Volatile Memory Will Be Discussed.","PeriodicalId":415158,"journal":{"name":"2021 IEEE/ACM 11th Workshop on Irregular Applications: Architectures and Algorithms (IA3)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131017149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Mapping Irregular Computations for Molecular Docking to the SX-Aurora TSUBASA Vector Engine","authors":"Leonardo Solis-Vasquez, E. Focht, Andreas Koch","doi":"10.1109/IA354616.2021.00008","DOIUrl":"https://doi.org/10.1109/IA354616.2021.00008","url":null,"abstract":"Molecular docking is a key method in computer-aided drug design, where the rapid identification of drug candidates is crucial for combating diseases. AutoDock is a widely-used molecular docking program, having an irregular structure characterized by a divergent control flow and compute-intensive calculations. This work investigates porting AutoDock to the SX-Aurora TSUBASA vector engine and evaluates the achievable performance on a number of real-world input compounds. In particular, we discuss the platform-specific coding styles required to handle the high degree of irregularity in both local-search methods employed by AutoDock. These Solis-Wets and ADADELTA methods take up a large part of the total computation time. Based on our experiments, we achieved runtimes on the SX-Aurora TSUBASA VE 20B that are on average 3 x faster than on modern dual-socket 64-core CPU nodes. Our solution is competitive with V100 GPUs, even though these already use newer chip fabrication technology (12 nm vs. 16 nm on the VE 20B).","PeriodicalId":415158,"journal":{"name":"2021 IEEE/ACM 11th Workshop on Irregular Applications: Architectures and Algorithms (IA3)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131346018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}