{"title":"A Locality-aware Cooperative Distributed Memory Caching for Parallel Data Analytic Applications","authors":"Chia–Ting Hung, J. Chou, Ming-Hung Chen, I. Chung","doi":"10.1109/IPDPSW55747.2022.00183","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00183","url":null,"abstract":"Memory caching has long been used to fill up the performance gap between processor and disk for reducing the data access time of data-intensive computations. Previous studies on caching mostly focus on optimizing the hit rate of a single machine. But in this paper, we argue that the caching decision of a distributed memory system should be performed in a cooperative manner for the parallel data analytic applications, which are commonly used by emerging technologies, such as Big Data and AI (Artificial Intelligence), to perform data mining and sophisticated analytics on larger data volume in a shorter time. A parallel data analytic job consists of multiple parallel tasks. Hence, the completion time of a job is bounded by its slowest task, meaning that the job cannot benefit from caching until all inputs of its tasks are cached. To address the problem, we proposed a cooperative caching design that periodically rearranges the cache placement among nodes according to the data access pattern while taking the task dependency and network locality into account. Our approach is evaluated by a trace-driven simulator using both synthetic workload and real-world traces. The results show that we can reduce the average completion times up to 33% compared to a non-collaborative caching polices and 25% compared to other start-of-the-art collaborative caching policies.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121684445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"CORtEX 2022 Invited Speaker 4: Large-scale simulations of mammalian brains using peta- to exa-scale computing","authors":"J. Igarashi","doi":"10.1109/IPDPSW55747.2022.00216","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00216","url":null,"abstract":"A whole-brain simulation allows us to investigate all interactions among neurons in the brain to understand the mechanisms of information processing and brain diseases. The computational performance of exascale supercomputers in the 2020s is estimated to realize whole-brain simulation at a human scale. However, it has not been realized to sufficiently reproduce and predict neural behaviors and functionality of the whole brain due to the lack of computational resources, physiological and anatomical data, brain models, and neural network simulators. We have studied large-scale brain simulations with various supercomputers toward whole brain simulations. In this talk, we will introduce studies on developing efficient spiking neural simulators, modeling brain disease, and large-scale simulations of the cortico-cerebello-thalamic circuit using the supercomputer Fugaku.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121695712","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Distributed Algorithms for the Graph Biconnectivity and Least Common Ancestor Problems","authors":"Ian Bogle, George M. Slota","doi":"10.1109/IPDPSW55747.2022.00187","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00187","url":null,"abstract":"Graph connectivity analysis is one of the primary ways to analyze the topological structure of social networks. Graph biconnectivity decompositions are of particular interest due to how they identify cut vertices and cut edges in a network. We present the first, to our knowledge, implementation of a distributed-memory parallel biconnectivity algorithm. As part of our algorithm, we also require the computation of least common ancestors (LCAs) of non-tree edge endpoints in a BFS tree. As such, we also propose a novel distributed algorithm for the LCA problem. Using our implementations, we observe up to a 14.8× speedup from 1 to 128 MPI ranks for computing a biconnectivity decomposition.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"8 Pt 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126270743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modeling Power Consumption of Lossy Compressed I/O for Exascale HPC Systems","authors":"Grant Wilkins, Jon C. Calhoun","doi":"10.1109/IPDPSW55747.2022.00184","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00184","url":null,"abstract":"Exascale computing enables unprecedented, detailed and coupled scientific simulations which generate data on the order of tens of petabytes. Due to large data volumes, lossy compressors become indispensable as they enable better compression ratios and runtime performance than lossless compressors. Moreover, as (high-performance computing) HPC systems grow larger, they draw power on the scale of tens of megawatts. Data motion is expensive in time and energy. Therefore, optimizing compressor and data I/O power usage is an important step in reducing energy consumption to meet sustainable computing goals and stay within limited power budgets. In this paper, we explore efficient power consumption gains for the SZ and ZFP lossy compressors and data writing on a cloud HPC system while varying the CPU frequency, scientific data sets, and system architecture. Using this power consumption data, we construct a power model for lossy compression and present a tuning methodology that reduces energy overhead of lossy compressors and data writing on HPC systems by 14.3% on average. We apply our model and find 6.5 kJ s, or 13 %, of savings on average for 512GB I/O. Therefore, utilizing our model results in more energy efficient lossy data compression and I/O.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126935991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Cèdric Prigent, Loïc Cudennec, Alexandru Costan, Gabriel Antoniu
{"title":"A Methodology to Build Decision Analysis Tools Applied to Distributed Reinforcement Learning","authors":"Cèdric Prigent, Loïc Cudennec, Alexandru Costan, Gabriel Antoniu","doi":"10.1109/IPDPSW55747.2022.00173","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00173","url":null,"abstract":"As Artificial Intelligence-based applications become more and more complex, speeding up the learning phase (which is typically computation-intensive) becomes more and more necessary. Distributed machine learning (ML) appears adequate to address this problem. Unfortunately, ML also brings new development frameworks, methodologies and high-level program-ming languages that do not fit to the regular high-performance computing design flow. This paper introduces a methodology to build a decision making tool that allows ML experts to arbitrate between different frameworks and deployment configurations, in order to fulfill project objectives such as the accuracy of the resulting model, the computing speed or the energy consumption of the learning computation. The proposed methodology is applied to an industrial-grade case study in which reinforcement learning is used to train an autonomous steering model for a cargo airdrop system. Results are presented within a Pareto front that lets ML experts choose an appropriate solution, a framework and a deployment configuration, based on the current operational situation. While the proposed approach can effortlessly be applied to other machine learning problems, as for many decision making systems, the selected solutions involve a trade-off between several antagonist evaluation criteria and require experts from different domains to pick the most efficient solution from the short list. Nevertheless, this methodology speeds up the development process by clearly discarding, or, on the contrary, including combinations of frameworks and configurations, which has a significant impact for time and budget-constrained projects.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125910431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shayan Manoochehri, Patrick Cristofaro, D. Goswami
{"title":"A Customizable Lightweight STM for Irregular Algorithms on GPU","authors":"Shayan Manoochehri, Patrick Cristofaro, D. Goswami","doi":"10.1109/IPDPSW55747.2022.00098","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00098","url":null,"abstract":"Irregular algorithms are often encountered in highly data-centric application domains. These algorithms operate on irregular data structures such as sparse graphs with irregular access patterns, which may also modify the underlying topology unpredictably. High computational time and inherent data parallelism present in these algorithms motivate the use of GPUs for speeding things up, however there are challenges for their efficient implementations due to: difficulty in protecting the shared data consistency in the presence of concurrent dynamic transactions; irregular access patterns due to unstructured data structures; and dynamic structural modifications of the underlying topology. One approach to overcome these challenges is to use Software Transactional Memory (STM). However, overly complex design and implementations of contemporary STM-based approaches and lack of proper framework to employ them in conjunction with the irregular algorithms stalls their adoption by the programming community. To overcome some of these challenges, this research proposes a lightweight STM with a simple design (Lite GSTM), based on a lock stealing algorithm, and an associated extensible framework to hide the complexity of the STM from a programmer. The framework is extensible by allowing plug-ins of customized STMs designed for different needs of transactions. The use of the framework is elaborated with two use cases which employ completely different irregular algorithms, however, have some common features: the underlying data structure is a graph, and the graph is structurally modified (coarsened) unpredictably in the course of execution. The paper presents the performance comparisons of the STM-based implementations with respect to their sequential and non-STM based counterparts, which show promising results.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126558746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatic Parallelization of Programs via Software Stream Rewriting","authors":"Tao Tao, D. Plaisted","doi":"10.1109/IPDPSW55747.2022.00094","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00094","url":null,"abstract":"We introduce a system for automatically paral-lelizing programs using a parallel-by-default language based on stream rewriting. Our method is general and supports all programs that can be written in a typical high-level, imperative language. The technique is fine-grained and fully automatic. It requires no programmer annotation, static analysis, runtime profiling, or cutoff schemes. The only assumption is that all function arguments in the input program can be executed in parallel. This does not affect the generality of our system since the programmers can write sequential parts in continuation-passing style. Experiments show that the runtime can scale computation-bound programs up to 16 cores without performance degradation. Future works remain to improve key aspects of the runtime and further increase the system's performance.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128130960","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards a GraphBLAS Implementation for Go","authors":"Pascal Costanza, I. Hur, T. Mattson","doi":"10.1109/IPDPSW55747.2022.00052","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00052","url":null,"abstract":"The GraphBLAS are building blocks for constructing graph algorithms as linear algebra. They are defined mathematically with the goal that they would eventually map onto a variety of programming languages. Today they exist in C, C++, Python, MATLAB®, and Julia. In this paper, we describe the GraphBLAS for the Go programming language. A particularly interesting aspect of this work is that using the concurrency features of the Go language, we aim to build a runtime system that uses the GraphBLAS nonblocking mode by default.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128218219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ReconOS64: A Hardware Operating System for Modern Platform FPGAs with 64-Bit Support","authors":"L. Clausing, M. Platzner","doi":"10.1109/IPDPSW55747.2022.00029","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00029","url":null,"abstract":"Reconfigurable hardware operating systems provide software-like abstractions for hardware accelerators. In particu-lar abstractions that view hardware accelerators as threads and integrate them into a multi-threaded environment have received popularity. However, such abstractions are not yet available for latest platform FPGAs. In this paper, we present ReconOS64, a reconfigurable hard-ware operating system for 64-Bit modern platform FPGAs. We discuss the architecture and the build flow and report on a number of experiments that evaluate the performance of the system. In particular, we compare the performance to a previous, 32- Bit ReconOS system. The evaluation shows that the step towards 64- Bit is not only necessary to make hardware operating system support available for modern platform FPGAs, but also improves the performance of operating system calls and memory accesses for hardware threads.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127263776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Modeling Memory Contention between Communications and Computations in Distributed HPC Systems","authors":"Alexandre Denis, E. Jeannot, Philippe Swartvagher","doi":"10.1109/IPDPSW55747.2022.00086","DOIUrl":"https://doi.org/10.1109/IPDPSW55747.2022.00086","url":null,"abstract":"To amortize the cost of MPI communications, distributed parallel HPC applications can overlap network communications with computations in the hope that it improves global application performance. When using this technique, both computations and communications are running at the same time. But computation usually also performs some data movements. Since data for computations and for communications use the same memory system, memory contention may occur when computations are memory-bound and large messages are transmitted through the network at the same time. In this paper we propose a model to predict memory band-width for computations and for communications when they are executed side by side, according to data locality and taking contention into account. Elaboration of the model allowed to better understand locations of bottleneck in the memory system and what are the strategies of the memory system in case of contention. The model was evaluated on many platforms with different characteristics, and showed a prediction error in average lower than 4 %.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"83 ","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133784500","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}