A. Handleman, Arthur G. Rattew, I. Lee, T. Schardl
{"title":"A Hybrid Scheduling Scheme for Parallel Loops","authors":"A. Handleman, Arthur G. Rattew, I. Lee, T. Schardl","doi":"10.1109/IPDPS49936.2021.00067","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00067","url":null,"abstract":"Parallel loops are commonly used parallel constructs to parallelize high-performance scientific applications. In the paradigm of task parallelism, the parallel loop construct is used to express the logical parallelism of the loop, indicating that the iterations in a loop are logically in parallel and let an underlying runtime scheduler determines how to best map the parallel iterations onto available processing cores. Researchers have investigated multiple scheduling schemes for scheduling parallel loops, with the static partitioning and dynamic partitioning being most prevalent. Static partitioning obtains low scheduling overhead while potentially retaining locality benefit in iterative applications that perform a sequence of parallel loops that access the same set of data repeatedly. But static partitioning may perform poorly relatively to dynamic partitioning if the loop iterations contain unbalanced workloads or if the cores can arrive at the loops in different times. We propose a hybrid scheduling scheme, which first schedules loops using static partitioning but then employs dynamic partitioning when load balancing is necessary. Moreover, the work distribution employs a claiming heuristic that allows a core to check for partitions to work on in a semi-deterministic fashion, allowing the scheduling to better retain data locality in the case of iterative applications. Unlike prior work that optimizes for iterative applications, our scheme does not require programmer annotations and can provide provably efficient execution time. In this paper, we discuss the hybrid scheme, prove its correctness, and analyze its scheduling bound. We have also implemented the proposed scheme in a Cilk-based work-stealing platform and experimentally verified that the scheme load balances well and can retain locality for such iterative applications.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130407207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Characterizing Small-Scale Matrix Multiplications on ARMv8-based Many-Core Architectures","authors":"Weiling Yang, Jianbin Fang, Dezun Dong","doi":"10.1109/IPDPS49936.2021.00019","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00019","url":null,"abstract":"General Matrix Multiplication (GEMM) is a key subroutine in high-performance computing. There is a large body of work on evaluating and optimizing large-scale matrix multiplication, but how well the small-scale matrix multiplication (SMM) performs is largely unknown, especially for the ARMv8-based many-core architectures. In this work, we evaluate and characterize the performance of SMM subroutines on Phytium 2000 +, an ARMv8-based 64-core architecture. The evaluation work is extensively performed with the mainstream open-source libraries including OpenBLAS, BLIS, BALSFEO, and Eigen. Given various experimental settings, we observe how well the small-scale GEMM routines perform on Phytium 2000 +, and then discuss the impacting factors behind the performance behaviours of SMM. Built on such a basis, we shed light on the performance bottlenecks and practical optimizations on SMM from various angles: (1) mitigating the data packing overhead, (2) processing the edge cases properly, (3) selecting a suitable micro-kernel, and (4) adopting a right parallelization method. The result of our work facilitates users to develop efficient SMM optimizations on ARMv8-based many-core architectures, and embed them into real-world applications.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129074101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Towards Internet-Scale Convolutional Root-Cause Analysis with DIAGNET","authors":"Loïck Bonniot, C. Neumann, François Taïani","doi":"10.1109/IPDPS49936.2021.00084","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00084","url":null,"abstract":"Diagnosing problems in Internet-scale services remains particularly difficult and costly for both content providers and ISPs. Because the Internet is decentralized, the cause of such problems might lie anywhere between a user’s device and the datacenters hosting the service. Further, the set of possible problems and causes is not known in advance, making it impossible in practice to train a classifier with all combinations of problems, causes and locations.In this paper, we explore how machine learning techniques can be used for Internet-scale root cause analysis based on measurements taken from end-user devices. Using convolutional neural networks, we show how to build generic models that (i) are agnostic to the underlying network topology, (ii) do not require to define the full set of possible causes during training, and (iii) can be quickly adapted to diagnose new services. We evaluate our proposal, DIAGNET, on a geodistributed multi-cloud deployment of online services, using a combination of fault injection and emulated clients running within automated browsers. Our experiments demonstrate the promising capabilities of our technique, delivering a recall of 73.9%, including on causes that were unknown at training time.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129066684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lechen Yu, Joachim Protze, Oscar R. Hernandez, Vivek Sarkar
{"title":"ARBALEST: Dynamic Detection of Data Mapping Issues in Heterogeneous OpenMP Applications","authors":"Lechen Yu, Joachim Protze, Oscar R. Hernandez, Vivek Sarkar","doi":"10.1109/IPDPS49936.2021.00055","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00055","url":null,"abstract":"From OpenMP 4.0 onwards, programmers can offload code regions to accelerators by using the target offloading feature. However, incorrect usage of target offloading constructs may incur data mapping issues. A data mapping issue occurs when the host fails to observe updates on the accelerator or vice versa. It may further lead to multiple memory issues such as use of uninitialized memory, use of stale data, and data race. To the best of our knowledge, currently there is no prior work on dynamic detection of data mapping issues in heterogeneous OpenMP applications.In this paper, we identify possible root causes of data mapping issues in OpenMP’s standard memory model and the unified memory model. We find that data mapping issues primarily result from incorrect settings of map and nowait clauses in target offloading constructs. Further, the novel unified memory model introduced in OpenMP 5.0 cannot avoid the occurrence of data mapping issues. To mitigate the difficulty of detecting data mapping issues, we propose ARBALEST, an on-the-fly data mapping issue detector for OpenMP applications. For each variable mapped to the accelerator, ARBALEST’s detection algorithm leverages a state machine to track the last write’s visibility. ARBALEST requires constant storage space for each memory location and takes amortized constant time per memory access. To demonstrate ARBALEST’s effectiveness, an experimental comparison with four other dynamic analysis tools (Valgrind, Archer, AddressSanitizer, MemorySanitizer) has been carried out on a number of open-source benchmark suites. The evaluation results show that ARBALEST delivers demonstrably better precision than the other four tools, and its execution time overhead is comparable to that of state-of-the-art dynamic analysis tools.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122516543","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Combining XOR and Partner Checkpointing for Resilient Multilevel Checkpoint/Restart","authors":"Masoud Gholami, F. Schintke","doi":"10.1109/IPDPS49936.2021.00036","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00036","url":null,"abstract":"Checkpoint/restart (C/R) makes large-scale parallel jobs resilient against multiple node failures but typically takes considerable time and storage space. Efficient C/R strategies try to gain high levels of fault-tolerance while keeping the involved I/O and computation low. By combining XOR and partner checkpointing, two relatively weak C/R strategies, we develop and evaluate a stable, scalable, and fast C/R approach (including initialization, checkpointing, version consensus, and recovery mechanisms) that outperforms other C/R methods such as Reed-Solomon checkpointing in terms of stability and performance.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115556054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"NVMe-CR: A Scalable Ephemeral Storage Runtime for Checkpoint/Restart with NVMe-over-Fabrics","authors":"Shashank Gugnani, Tianxi Li, Xiaoyi Lu","doi":"10.1109/IPDPS49936.2021.00026","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00026","url":null,"abstract":"Emerging SSDs with NVMe-over-Fabrics (NVMf) support provide new opportunities to significantly improve the performance of IO-intensive HPC applications. However, state-of-the-art parallel filesystems can not extract the best possible performance from fast NVMe SSDs and are not designed for latency-critical ephemeral IO tasks, such as checkpoint/restart. In this paper, we propose a powerful abstraction called microfs to peel away unnecessary software layers and eliminate namespace coordination. Building upon this abstraction, we present the design of NVMe-CR, a scalable ephemeral storage runtime for clusters with disaggregated compute and storage. NVMe-CR proposes techniques like metadata provenance, log record coalescing, and logically isolated shared device access, built around the microfs abstraction, to reduce the overhead of writing millions of concurrent checkpoint files. NVMe-CR utilizes high-density allflash arrays accessible via NVMf to absorb bursty checkpoint IO and increase the progress rates of applications obliviously. Using the ECP CoMD application as a use case, results show that our runtime can achieve near perfect (> 0.96) efficiency at 448 processes and reduce checkpoint overhead by as much as 2x compared to state-of-the-art storage systems.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130778303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Bing Xie, Zilong Tan, P. Carns, J. Chase, K. Harms, J. Lofstead, S. Oral, Sudharshan S. Vazhkudai, Feiyi Wang
{"title":"Interpreting Write Performance of Supercomputer I/O Systems with Regression Models","authors":"Bing Xie, Zilong Tan, P. Carns, J. Chase, K. Harms, J. Lofstead, S. Oral, Sudharshan S. Vazhkudai, Feiyi Wang","doi":"10.1109/IPDPS49936.2021.00064","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00064","url":null,"abstract":"This work seeks to advance the state of the art in HPC I/O performance analysis and interpretation. In particular, we demonstrate effective techniques to: (1) model output performance in the presence of I/O interference from production loads; (2) build features from write patterns and key parameters of the system architecture and configurations; (3) employ suitable machine learning algorithms to improve model accuracy. We train models with five popular regression algorithms and conduct experiments on two distinct production HPC platforms. We find that the lasso and random forest models predict output performance with high accuracy on both of the target systems. We also explore use of the models to guide adaptation in I/O middleware systems, and show potential for improvements of at least 15% from model-guided adaptation on 70% of samples, and improvements up to $10 times$ on some samples for both of the target systems.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"132 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124438044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
George Michelogiannakis, D. Lyles, Patricia Gonzalez-Guerrero, M. Bautista, Dilip P. Vasudevan, Anastasiia Butko
{"title":"SRNoC: A Statically-Scheduled Circuit-Switched Superconducting Race Logic NoC","authors":"George Michelogiannakis, D. Lyles, Patricia Gonzalez-Guerrero, M. Bautista, Dilip P. Vasudevan, Anastasiia Butko","doi":"10.1109/IPDPS49936.2021.00113","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00113","url":null,"abstract":"Temporal encoding has been shown to be a natural fit for single flux quantum (SFQ) superconducting computing since SFQ already encodes information with the presence or absence of voltage pulses. However, past work in SFQ has focused on binary-encoded networks on chip (NoCs). In this paper, we propose superconducting rotary NoC (SRNoC), a NoC where both data and control paths operate in the temporal domain following the race logic (RL) convention. Therefore, SFQ chips with temporal compute or memory can use SRNoC to avoid converting between the temporal and binary domains that would result from using a binary-encoded NoC. Using RL also enables SRNoC to be area-efficient, mitigating SFQ technology’s low device density. SRNoC treats pulses as independent packets and delivers them to outputs without changing their value, i.e. preserving the RL convention. SRNoC operates on a fixed, rotating connection schedule between inputs and outputs. In each connection window, multiple pulses (packets) can be transmitted sequentially. SRNoC provides $13.1times$ higher throughput per port per Josephson junction (JJ) compared to the best-performing of three demonstrated NoCs.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123420809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaoyi Zhang, Feng Zhu, Shu Li, Kun Wang, Wei Xu, Dengcai Xu
{"title":"Optimizing Performance for Open-Channel SSDs in Cloud Storage System","authors":"Xiaoyi Zhang, Feng Zhu, Shu Li, Kun Wang, Wei Xu, Dengcai Xu","doi":"10.1109/IPDPS49936.2021.00099","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00099","url":null,"abstract":"In large-scale cloud storage systems, Solid-State Drive (SSD) has been broadly used as the mainstream storage device because it has the advantages of low access latency and high throughput. However, conventional SSD is a black-box system to host softwares, thus failing to fully exploit the benefits of NAND flash and provide high quality of service (QoS). On the other hand, Open-Channel SSD (OCSSD) which exposes its internal information to the host software, has the potential to solve this problem. However, existing OCSSD fails to achieve anticipated performance under heavy workloads. To this end, we propose an advanced OCSSD-based driver developed with the novel data placement policy, redefined garbage collection (GC) with copyback technique, efficient prefetch read scheme, and fast live upgrade method. Our work describes the consistent efforts to pursue high performance and QoS in OCSSDs with different approaches. The evaluation results show that our novel Open-Channel SSD is able to provide high I/O throughputs and predictable I/O latencies. For example, our Open-Channel SSD can improve I/O throughputs by 103% and reduce the 99th percentile latency by 62.9% on average compared with the state-of-the-art NVMe SSDs.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121286924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Covirt: Lightweight Fault Isolation and Resource Protection for Co-Kernels","authors":"Nicholas Gordon, J. Lange","doi":"10.1109/IPDPS49936.2021.00039","DOIUrl":"https://doi.org/10.1109/IPDPS49936.2021.00039","url":null,"abstract":"The challenges of the exascale era have generated a number of advancements in HPC systems software, with co-kernel architectures emerging as one such novel approach for HPC operating system and runtime (OS/R) design. Cokernels function by running multiple specialized, lightweight OS kernels natively on the same host as a general purpose OS/R. These specialized kernels are able to provide optimized OS/R environments for HPC applications while still retaining access to the full feature set of the co-running general purpose OS/R. While co-kernels are able to effectively optimize for performance, they generally lack effective mechanisms for cross OS/R fault isolation and resource protection. In this paper we present Covirt, a lightweight OS/R protection layer that leverages the hardware virtualization features found on modern CPUs. Covirt interposes a minimal hypervisor layer between a co-kernel OS/R and hardware to prevent OS level faults from impacting other OS/Rs running on the same system. Covirt is different from other virtualization-based approaches due to the level of integration necessary between the co-kernel instances, requiring the support of higher level semantic interfaces between the different OS/Rs. Covirt features a split architecture consisting of a hypervisor and controller module that continuously monitors changes to the underlying resource partitioning and translates those events to hypervisor configuration changes. We have implemented a prototype of Covirt in the context of the Hobbes exascale OS/R stack, specifically targeting the Pisces co-kernel framework and Kitten Lightweight Kernel. Our evaluation shows that Covirt is able to add fault isolation for memory and interrupt processing with minimal performance overheads.","PeriodicalId":372234,"journal":{"name":"2021 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"169 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116398223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}