{"title":"Symmetric Tokens based Group Mutual Exclusion","authors":"A. Aravind","doi":"10.1145/3409390.3409395","DOIUrl":"https://doi.org/10.1145/3409390.3409395","url":null,"abstract":"The group mutual exclusion (GME) problem is a generalization of the mutual exclusion problem. The problem is fundamental to parallel and distributed processing, as it is inherent in several applications in the modern multicore-integrated cloud era of the distributed computing world. This paper proposes a First-Come-First-Served (FCFS) GME algorithm that only uses atomic read/write operations for n threads. The proposed algorithm has three key features: (i) its simplicity; (ii) it has complexity in both space (shared variable requirement) and time (remote memory references (RMR)) in cache coherent (CC) models; and (ii) it settles the open problem posed in 2001.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115081700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fast Modeling of Network Contention in Batch Point-to-point Communications by Packet-level Simulation with Dynamic Time-stepping","authors":"Zhang Yang, Jintao Peng, Qingkai Liu","doi":"10.1145/3409390.3409398","DOIUrl":"https://doi.org/10.1145/3409390.3409398","url":null,"abstract":"Network contention has long been one of the root causes of performance loss in large-scale parallel applications. With the increasing importance of performance modeling to both large-scale application optimization and application-system co-design, the conflict of speed and accuracy in contention modeling is becoming prominent. Cycle-accurate network simulators are often too slow for large scale applications, while point-to-point analytical models are not accurate enough to capture the contention effects. To model the network contention in batch point-to-point communications, we propose a unified contention model after the flow-fair end-to-end congestion control mechanism. The model uses packet-level simulations to be accurate, but can be approximated by a flow-level semi-analytical model when messages are large enough, thus is fast. Furthermore, we propose a dynamic time-stepping technique which significantly speeds up the packet-level simulation with only minor accuracy loss. Experiments with typical communication patterns and application traces show that our model accurately predicates the communication time with an average error of 9%(fixed time step) and the dynamic time-stepping technique improve the simulation performance by up to 131 folds with an average accuracy loss of 10.5% for real application traces.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115218870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Communication-aware Job Scheduling using SLURM","authors":"P. Mishra, Tushar Agrawal, Preeti Malakar","doi":"10.1145/3409390.3409410","DOIUrl":"https://doi.org/10.1145/3409390.3409410","url":null,"abstract":"Job schedulers play an important role in selecting optimal resources for the submitted jobs. However, most of the current job schedulers do not consider job-specific characteristics such as communication patterns during resource allocation. This often leads to sub-optimal node allocations. We propose three node allocation algorithms that consider the job’s communication behavior to improve the performance of communication-intensive jobs. We develop our algorithms for tree-based network topologies. The proposed algorithms aim at minimizing network contention by allocating nodes on the least contended switches. We also show that allocating nodes in powers of two leads to a decrease in inter-switch communication for MPI communications, which further improves performance. We implement and evaluate our algorithms using SLURM, a widely-used and well-known job scheduler. We show that the proposed algorithms can reduce the execution times of communication-intensive jobs by 9% (326 hours) on average. The average wait time of jobs is reduced by 31% across three supercomputer job logs.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129069036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Preference Aware Smart Hospital Selection System for Patients","authors":"Md. Solaiman Chowdhury, Jenifar Rahman, Md. Mahfuzur Rahman","doi":"10.1145/3409390.3409391","DOIUrl":"https://doi.org/10.1145/3409390.3409391","url":null,"abstract":"With the rapid enhancement of wireless and mobile technologies, the context information of the user or environment can now easily be collected and analyzed to create useful services. The traditional healthcare facilities in most developing countries do not provide their medical services with equal quality. The patients face lots of difficulties in choosing the best-suited medical services or hospitals when they become sick. To make the proper decision for appropriate services, the patients need to consider many criteria that often create complexity. An efficient system is required to help the patients automatically accumulate the information necessary in making correct medical service selection. In this paper, we have proposed a preference-aware hospital selection model integrated into a cloud computing based context-aware system to satisfy the patients in selecting appropriate services. Through experimentation, we have shown that the developed system makes decisions accurately for the patients.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125589167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A GCC-based Compliance Checker for Single-translation-unit, Identifier-related MISRA-C Rules","authors":"Guan-Ren Wang, Peng-Sheng Chen","doi":"10.1145/3409390.3409396","DOIUrl":"https://doi.org/10.1145/3409390.3409396","url":null,"abstract":"MISRA-C is a well-defined software specification for the C programming language that gives programmers criteria to develop reliable programs. This paper implements a MISRA-C compliance checker based on the GCC compiler infrastructure. It focuses on identifier-related rules that are single-translation-unit-labeled. We describe and develop strategies for implementing the checking codes. We also discuss the rules that can be detected by existing GCC options. For the tested benchmark programs, the modified GCC compiler can correctly assess compliance with the target MISRA- C rules.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"2030 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129774815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Assessing the Overhead of Offloading Compression Tasks","authors":"L. Promberger, R. Schwemmer, H. Fröning","doi":"10.1145/3409390.3409405","DOIUrl":"https://doi.org/10.1145/3409390.3409405","url":null,"abstract":"Exploring compression is increasingly promising as trade-off between computations and data movement. There are two main reasons: First, the gap between processing speed and I/O continues to grow, and technology trends indicate a continuation of this. Second, performance is determined by energy efficiency, and the overall power consumption is dominated by the consumption of data movements. For these reasons there is already a plethora of related works on compression from various domains. Most recently, a couple of accelerators have been introduced to offload compression tasks from the main processor, for instance by AHA, Intel and Microsoft. Yet, one lacks the understanding of the overhead of compression when offloading tasks. In particular, such offloading is most beneficial for overlap with other tasks, if the associated overhead on the main processor is negligible. This work evaluates the integration costs compared to a solely software-based solution considering multiple compression algorithms. Among others, High Energy Physics data are used as a prime example of big data sources. The results imply that on average the zlib implementation on the accelerator achieves a comparable compression ratio to zlib level 2 on a CPU, while having up to 17 times the throughput and utilizing over 80 % less CPU resources. These results suggest that, given the right orchestration of compression and data movement tasks, the overhead of offloading compression is limited but present. Considering that compression is only a single task of a larger data processing pipeline, this overhead cannot be neglected.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"272 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122763264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Improving the Space-Time Efficiency of Matrix Multiplication Algorithms","authors":"Yuan Tang","doi":"10.1145/3409390.3409404","DOIUrl":"https://doi.org/10.1145/3409390.3409404","url":null,"abstract":"Classic cache-oblivious parallel matrix multiplication algorithms achieve optimality either in time or space, but not both, which promotes lots of research on the best possible balance or trade-off of such algorithms. We study modern processor-oblivious runtime systems and figure out several ways to improve algorithm’s time complexity while still bounding space and cache requirements to be asymptotically optimal. By our study, we give out sub-linear time, optimal work, space and caching algorithms for both general matrix multiplication on a semiring and Strassen-like fast algorithms on a ring. Our experiments show such algorithms have empirical advantages over classic counterparts. Our study provides new insights and research angles on how to optimize cache-oblivious parallel algorithms from both theoretical and empirical perspectives.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122846715","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Network and Load-Aware Resource Manager for MPI Programs","authors":"Ashish Kumar Kumar, N. Jain, Preeti Malakar","doi":"10.1145/3409390.3409406","DOIUrl":"https://doi.org/10.1145/3409390.3409406","url":null,"abstract":"We present a resource broker for MPI jobs in a shared cluster, considering the current compute load and available network bandwidths. MPI programs are generally communication-intensive. Thus the current network availability between the compute nodes impacts performance. Many existing resource allocation techniques mostly consider static node attributes and some dynamic resource attributes. This does not lead to a good allocation in case of shared clusters because the network usage and system load vary. We developed a load and network-aware heuristic for resource allocation. We incorporated the current network state in our heuristic. It is able to reduce execution times by more than 38% on average as compared to the default allocation.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122013226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Saleh Khalaj Monfared, Omid Hajihassani, M. Kiarostami, S. M. Zanjani, Dara Rahmati, S. Gorgin
{"title":"BSRNG: A High Throughput Parallel BitSliced Approach for Random Number Generators","authors":"Saleh Khalaj Monfared, Omid Hajihassani, M. Kiarostami, S. M. Zanjani, Dara Rahmati, S. Gorgin","doi":"10.1145/3409390.3409402","DOIUrl":"https://doi.org/10.1145/3409390.3409402","url":null,"abstract":"In this work, a high throughput method for generating high-quality Pseudo-Random Numbers using the bitslicing technique is proposed. In such a technique, instead of the conventional row-major data representation, column-major data representation is employed, which allows the bitslicing implementation to take full advantage of all the available datapath of the hardware platform. By employing this data representation as building blocks of algorithms, we showcase the capability and scalability of our proposed method in various PRNG methods in the category of block and stream ciphers. The LFSR-based (Linear Feedback Shift Register) nature of the PRNG in our implementation perfectly suits the GPU’s many-core structure due to its register oriented architecture. In the proposed SIMD vectorized GPU implementation, each GPU thread can generate several 32 pseudo-random bits in each LFSR clock cycle. We then compare our implementation with some of the most significant PRNGs that display a satisfactory performance throughput and randomness criteria. The proposed implementation successfully passes the NIST test for statistical randomness and bit-wise correlation criteria. For computer-based PRNG and the optical solutions in terms of performance and performance per cost, this technique is efficient while maintaining an acceptable randomness measure. Our highest performance among all of the implemented CPRNGs with the proposed method is achieved by the MICKEY 2.0 algorithm, which shows 40% improvement over state of the art NVIDIA’s proprietary high-performance PRNG, cuRAND library, achieving 2.72 Tb/s of throughput on the affordable NVIDIA GTX 2080 Ti.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"111 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127508806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Randomized Authentication using IBE for Opportunistic Networks","authors":"Kai Wang, Kazuya Sakai","doi":"10.1145/3409390.3409392","DOIUrl":"https://doi.org/10.1145/3409390.3409392","url":null,"abstract":"Opportunistic networks (ONs) are widely used in many critical network applications, and security/privacy issues in ONs are critical for its wide adaption. In this paper, we propose a randomized authentication protocol which consists of node registration and authentication phases using identity-based encpryption (IBE) and trust framework. The key ideas of our authentication protocol are to generate public keys from publicly available node IDs, and not only central registration server but also the nodes with a high trust value can authenticate nodes in a network. By doing this, our protocol is of light-weight and the authentication process is randomized in a distributed way. In addition, to accommodate the disadvantage of IBE, we introduce the idea of distributed KGCs (key generation centers) and the trust framework. The protocol level security of the proposed scheme is proven by indistinguishability-based provable security analysis using random oracles, and the qualitative security analyses for various attacks are conducted.","PeriodicalId":350506,"journal":{"name":"Workshop Proceedings of the 49th International Conference on Parallel Processing","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-08-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128326974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}