{"title":"A Practical, Scalable, Relaxed Priority Queue","authors":"Tingzhe Zhou, Maged M. Michael, Michael F. Spear","doi":"10.1145/3337821.3337911","DOIUrl":"https://doi.org/10.1145/3337821.3337911","url":null,"abstract":"Priority queues are a fundamental data structure, and in highly concurrent software, scalable priority queues are an important building block. However, they have a fundamental bottleneck when extracting elements, because of the strict requirement that each extract() returns the highest priority element. In many workloads, this requirement can be relaxed, improving scalability. We introduce ZMSQ, a scalable relaxed priority queue. It is the first relaxed priority queue that supports each of the following important practical features: (i) guaranteed success of extraction when the queue is nonempty, (ii) blocking of idle consumers, (iii) memory-safety in non-garbage-collected environments, and (iv) relaxation accuracy that does not degrade as the thread count increases. In addition, our experiments show that ZMSQ is competitive with state-of-the-art prior algorithms, often significantly outperforming them.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"58 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129877925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"AVR: Reducing Memory Traffic with Approximate Value Reconstruction","authors":"Albin Eldstål-Damlin, P. Trancoso, I. Sourdis","doi":"10.1145/3337821.3337824","DOIUrl":"https://doi.org/10.1145/3337821.3337824","url":null,"abstract":"This paper describes Approximate Value Reconstruction (AVR), an architecture for approximate memory compression. AVR reduces the memory traffic of applications that tolerate approximations in their dataset. Thereby, it utilizes more efficiently the available off-chip bandwidth improving significantly system performance and energy efficiency. AVR compresses memory blocks using low latency downsampling that exploits similarities between neighboring values and achieves aggressive compression ratios, up to 16:1 in our implementation. The proposed AVR architecture supports our compression scheme maximizing its effect and minimizing its overheads by (i) co-locating in the Last Level Cache (LLC) compressed and uncompressed data, (ii) efficiently handling LLC evictions, (iii) keeping track of badly compressed memory blocks, and (iv) avoiding LLC pollution with unwanted decompressed data. For applications that tolerate aggressive approximation in large fractions of their data, AVR reduces memory traffic by up to 70%, execution time by up to 55%, and energy costs by up to 20% introducing up to 1.2% error to the application output.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"30 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115929446","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Accelerating Long Read Alignment on Three Processors","authors":"Zonghao Feng, Shuang Qiu, Lipeng Wang, Qiong Luo","doi":"10.1145/3337821.3337918","DOIUrl":"https://doi.org/10.1145/3337821.3337918","url":null,"abstract":"Sequence alignment is a fundamental task in bioinformatics, because many downstream applications rely on it. The recent emergence of the third-generation sequencing technology requires new sequence alignment algorithms that handle longer read lengths as well as more sequencing errors. Furthermore, the rapidly increasing volume of sequence data calls for efficient analysis solutions. To address this need, we propose to utilize commodity parallel processors to perform the long read alignment. Specifically, we propose manymap, an acceleration of the leading CPU-based long read aligner minimap2 on the CPU, the GPU, and the Intel Xeon Phi processor. We eliminate intra-loop data dependency in the base-level alignment step of the original minimap2 through redesigning memory layouts of dynamic programming (DP) matrices. This change facilitates the effective vectorization of the most time-consuming procedure in alignment. Additionally, we apply architecture-aware optimizations, such as utilizing high bandwidth memory on Xeon Phi and concurrent kernel execution on GPU. We evaluate our manymap in comparison with the extended minimap2 on a Xeon Gold 5115 CPU, a Tesla V100 GPU, and a Xeon Phi 7210 processor. Our results show that manymap outperforms minimap2 by up to 2.3 times on the overall execution time and 4.5 times on the base-level alignment step.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126629135","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"BPP: A Realtime Block Access Pattern Mining Scheme for I/O Prediction","authors":"Chunjie Zhu, F. Wang, Binbing Hou","doi":"10.1145/3337821.3337904","DOIUrl":"https://doi.org/10.1145/3337821.3337904","url":null,"abstract":"Block access patterns refer to the regularities of accessed blocks, and can be used to effectively enhance the intelligence of block storage systems. However, existing algorithms fail to uncover block access patterns in efficient ways. They either suffer high time and space overhead or only focus on the simplest patterns like sequential ones. In this paper, we propose a realtime block access pattern mining scheme, called BPP, to mine block access patterns at run time with low time and space overhead for making efficient I/O predictions. To reduce the time and space overhead for mining block access patterns, BPP classifies block access patterns into simple and compound ones based on the mining costs of different patterns, and differentiates the mining policies for simple and compound patterns. BPP also adopts a novel garbage cleaning policy, which is specially designed based on the observed features of the obtained patterns to accurately detect valueless patterns and remove them as early as possible. With such a garbage cleaning policy, BPP further reduces the space overhead for managing and utilizing the obtained patterns. To demonstrate the effect of BPP, we conduct a series of experiments with real-world workloads. The experimental results show that BPP can significantly outperform the state-of-the-art I/O prediction schemes.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127449031","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ke Li, Haowei Huang, Xiaofeng Gao, Fan Wu, Guihai Chen
{"title":"QLEC","authors":"Ke Li, Haowei Huang, Xiaofeng Gao, Fan Wu, Guihai Chen","doi":"10.1145/3337821.3337926","DOIUrl":"https://doi.org/10.1145/3337821.3337926","url":null,"abstract":"With the emergence of Internet of Things (IoT), many battery-operated sensors are deployed in different applications to collect, process, and analyze useful information. In these applications, sensors are often grouped into different clusters to support higher scalability and better data aggregation. Clustering based on energy distribution among nodes can reduce energy consumption and prolong the network lifespan. In our paper, we propose a machine-learning-based energy-efficient clustering algorithm named QLEC to select cluster heads in high-dimensional space and help non-cluster-head nodes route packets. QLEC first selects cluster heads based on their residual energy through successive rounds. Besides, we prove the optimal cluster number in a high-dimensional wireless network and adopt it in our QLEC algorithm. Furthermore, Q-learning method is utilized to maximize residual energy of the network while routing packets from sensors to the base station (BS). The energy-efficient clustering problem in high dimensional space can be formed as an NP-Complete problem and QLEC is proved to solve it in the running time O(kX), where k is the cluster number and X is the number of updates Q-learning needs to converge. Extensive simulations and experiments based on a large-scale dataset show that the proposed scheme outperforms a newly proposed FCM-based algorithm and k-means clustering in terms of network lifespan, packet delivery rate, and transmission latency. To the best of our knowledge, this is the first work adopting Q-learning method in clustering problems in high-dimensional space.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124022077","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huayi Jin, Chentao Wu, Xin Xie, Jie Li, M. Guo, Hao Lin, Jianfeng Zhang
{"title":"Approximate Code: A Cost-Effective Erasure Coding Framework for Tiered Video Storage in Cloud Systems","authors":"Huayi Jin, Chentao Wu, Xin Xie, Jie Li, M. Guo, Hao Lin, Jianfeng Zhang","doi":"10.1145/3337821.3337869","DOIUrl":"https://doi.org/10.1145/3337821.3337869","url":null,"abstract":"Nowadays massive video data are stored in cloud storage systems, which are generated by various applications such as autonomous driving, news media, security monitoring, etc. Meanwhile, erasure coding is a popular technique in cloud storage to provide both high reliability and low monetary cost, where triple disk failure tolerant arrays (3DFTs) is a typical choice. Therefore, how to minimize the storage cost of video data in 3DFTs is a challenge for cloud storage systems. Although there are several solutions like approximate storage technique, they cannot guarantee low storage cost and high data reliability concurrently. To address this challenge, in this paper, we propose Approximate Code, which is an erasure coding framework for tiered video storage in cloud systems. The key idea of Approximate Code is distinguishing the important and unimportant data with different capabilities of fault tolerance. On one hand, for important data, Approximate Code provides triple parities to ensure high reliability. On the other hand, single/double parities are applied for unimportant data, which can save the storage cost and accelerate the recovery process. To demonstrate the effectiveness of Approximate Code, we conduct several experiments in Hadoop systems. The results show that, compared to traditional 3DFTs using various erasure codes such as RS, LRC, STAR and TIP-Code, Approximate Code reduces the number of parities by up to 55%, saves the storage cost by up to 20.8% and increase the recovery speed by up to 4.7X when double nodes fail.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129152799","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"BCL","authors":"Benjamin Brock, A. Buluç, K. Yelick","doi":"10.1145/3337821.3337912","DOIUrl":"https://doi.org/10.1145/3337821.3337912","url":null,"abstract":"One-sided communication is a useful paradigm for irregular parallel applications, but most one-sided programming environments, including MPI's one-sided interface and PGAS programming languages, lack application-level libraries to support these applications. We present the Berkeley Container Library, a set of generic, cross-platform, high-performance data structures for irregular applications, including queues, hash tables, Bloom filters and more. BCL is written in C++ using an internal DSL called the BCL Core that provides one-sided communication primitives such as remote get and remote put operations. The BCL Core has backends for MPI, OpenSHMEM, GASNet-EX, and UPC++, allowing BCL data structures to be used natively in programs written using any of these programming environments. Along with our internal DSL, we present the BCL ObjectContainer abstraction, which allows BCL data structures to transparently serialize complex data types while maintaining efficiency for primitive types. We also introduce the set of BCL data structures and evaluate their performance across a number of high-performance computing systems, demonstrating that BCL programs are competitive with hand-optimized code, even while hiding many of the underlying details of message aggregation, serialization, and synchronization.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116939037","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wei Gao, Jiarui Fang, Wenlai Zhao, Jinzhe Yang, Long Wang, L. Gan, H. Fu, Guangwen Yang
{"title":"swATOP","authors":"Wei Gao, Jiarui Fang, Wenlai Zhao, Jinzhe Yang, Long Wang, L. Gan, H. Fu, Guangwen Yang","doi":"10.1145/3337821.3337883","DOIUrl":"https://doi.org/10.1145/3337821.3337883","url":null,"abstract":"Achieving an optimized mapping of Deep Learning (DL) operators to new hardware architectures is the key to building a scalable DL system. However, handcrafted optimization involves huge engineering efforts, due to the variety of DL operator implementations and complex programming skills. Targeting the innovative many-core processor SW26010 adopted by the 3rd fastest supercomputer Sunway TaihuLight, an end-to-end automated framework called swATOP is presented as a more practical solution for DL operator optimization. Arithmetic intensive DL operators are expressed into an auto-tuning-friendly form, which is based on tensorized primitives. By describing the algorithm of a DL operator using our domain specific language (DSL), swATOP is able to derive and produce an optimal implementation by separating hardware-dependent optimization and hardware-agnostic optimization. Hardware-dependent optimization is encapsulated in a set of tensorized primitives with sufficient utilization of the underlying hardware features. The hardware-agnostic optimization contains a scheduler, an intermediate representation (IR) optimizer, an auto-tuner, and a code generator. These modules cooperate to perform an automatic design space exploration, to apply a set of programming techniques, to discover a near-optimal solution, and to generate the executable code. Our experiments show that swATOP is able to bring significant performance improvement on DL operators in over 88% of cases, compared with the best-handcrafted optimization. Compared to a black-box autotuner, the tuning and code generation time can be reduced to minutes from days using swATOP.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115403752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Specialized Concurrent Queue for Scheduling Irregular Workloads on GPUs","authors":"David Troendle, T. Ta, B. Jang","doi":"10.1145/3337821.3337837","DOIUrl":"https://doi.org/10.1145/3337821.3337837","url":null,"abstract":"The persistent thread model offers a viable solution for accelerating data-irregular workloads on Graphic Processing Units (GPUs). However, as the number of active threads increases, contention and retries on shared resources limit the efficiency of task scheduling among the persistent threads. To address this, we propose a highly scalable, non-blocking concurrent queue suitable for use as a GPU persistent thread task scheduler. The proposed concurrent queue has two novel properties: 1) The supporting enqueue/dequeue queue operations never suffer from retry overhead because the atomic operation does not fail and the queue empty exception has been refactored; and 2) The queue operates on an arbitrary number of queue entries for the same cost as a single entry. A proxy thread in each thread group performs all atomic operations on behalf of all threads in the group. These two novel properties substantially reduce thread contention caused by the GPU's lock-step Single Instruction Multiple Threads (SIMT) execution model. To demonstrate the performance and scalability of the proposed queue, we implemented a top-down Breadth First Search (BFS) based on the persistent thread model using 1) the proposed concurrent queue, and 2) two traditional concurrent queues; and analyzed its performance and scalability characteristics under different input graph datasets and hardware configurations. Our experiments show that the BFS implementation based on our proposed queue outperforms not only ones based on traditional queues but also the state-of-the-art BFS implementations found in the literature by a minimum of 1.26× and maximum of 36.23×. We also observed the scalability of our proposed queue is within 10% of the ideal linear speedup for up to the maximum number of threads supported by high-end discrete GPUs (14K threads in our experiment).","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114216579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Distributed Join Algorithms on Multi-CPU Clusters with GPUDirect RDMA","authors":"Chengxin Guo, Hong Chen, Feng Zhang, Cuiping Li","doi":"10.1145/3337821.3337862","DOIUrl":"https://doi.org/10.1145/3337821.3337862","url":null,"abstract":"In data management systems, query processing on GPUs or distributed clusters have proven to be an effective method for high efficiency. However, the high PCIe data transfer overhead between CPUs and GPUs, and the communication cost between nodes in distributed systems are usually bottleneck for improving system performance. Recently, GPUDirect RDMA has been developed and has received a lot of attention. It contains the features of the RDMA and GPUDirect technologies, which provides new opportunities for optimizing query processing. In this paper, we revisit the join algorithm, one of the most important operators in query processing, with GPUDirect RDMA. Specifically, we explore the performance of the hash join and sort merge join with GPUDirect RDMA. We present a new design using GPUDirect RDMA to improve the data communication in distributed join algorithms on multi-GPU clusters. We propose a series of techniques, including multi-layer data partitioning, and adaptive data communication path selection for various transmission channels. Experiments show that the proposed distributed join algorithms using GPUDirect RDMA achieve up to 1.83x performance speedup compared to the state-of-the-art distributed join algorithms. To the best of our knowledge, this is the first work for distributed GPU join algorithms. We believe that the insights and implications in this study shall shed lights on future researches using GPUDirect RDMA.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"165 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123265021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}