{"title":"Building Scalable NVM-based B+tree with HTM","authors":"Mengxing Liu, Jiankai Xing, Kang Chen, Yongwei Wu","doi":"10.1145/3337821.3337827","DOIUrl":"https://doi.org/10.1145/3337821.3337827","url":null,"abstract":"Emerging on-volatile memory (NVM) opens an opportunity to build durable data structures. However, to build a highly efficient complex data structure like B+tree on NVM is not easy. We investigate the essential performance bottleneck for NVM-based B+tree. Even with a single-core CPU, the performance is limited by the atomic-write size which plays an essential role in the trade-off between the persistent overhead and keeping leaf node entries sorted. For the multi-core setting, the overlapping of concurrency and persistency is key to the system scalability. Based on the analysis, we propose RNTree, a durable NVM-based B+tree using the hardware transactional memory (HTM). Our way of using HTM can actually address both problems mentioned above simultaneously. (1) HTM can use cache-line granularity to provide larger atomic-write size. Based on this, we propose a new slot-array approach which traces the order of entries in the leaf nodes while still reducing the number of persistent instructions. (2) With careful design, RNTree moves slow persistent instructions out of critical sections and proposes the dual slot array design, to extract more concurrency. For single thread, RNTree achieves 1.44×/4.2× higher throughput for single-key operations and range queries respectively. For multiple threads, the throughput of RNTree is 2.3× higher than state-of-the-art works.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122023692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Tessellating Star Stencils","authors":"Liang Yuan, Shan Huang, Yunquan Zhang, Hang Cao","doi":"10.1145/3337821.3337835","DOIUrl":"https://doi.org/10.1145/3337821.3337835","url":null,"abstract":"Stencil computations represent a very common class of nested loops in scientific and engineering applications. The exhaustively studied tiling is one of the most powerful transformation techniques to explore the data locality and parallelism. Existing work often uniformly handles different stencil shapes. This paper first presents a concept called natural block to identify the difference between the star and box stencils. Then we propose anew two-level tessellation scheme for star stencils, where the natural block, as well as its successors can tessellate the spatial space and their extensions along the time dimension are able to form a tessellation of the iteration space. Furthermore, a novel implementation technique called double updating is developed for star stencils specifically to improve the in-core data reuse pattern. Evaluation results are provided that demonstrate the effectiveness of the approach.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132258162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"N-Code: An Optimal RAID-6 MDS Array Code for Load Balancing and High I/O Performance","authors":"Ping Xie, Zhu Yuan, Jianzhong Huang, X. Qin","doi":"10.1145/3337821.3337829","DOIUrl":"https://doi.org/10.1145/3337821.3337829","url":null,"abstract":"Existing RAID-6 codes are developed to optimize either reads or writes for storage systems. To improve both read and write operations, this paper proposes a novel RAID-6 MDS array code called N-Code. N-Code exhibits three aspects of salient features: (i) read performance. N-Code assigns both horizontal parity chains and horizontal parities across disks, without generating a dedicated parity disk. Such a parity layout not only makes all the disks service normal reads, but also allows continuous data elements to share the same horizontal chain to optimize degraded reads; (ii) write performance. Diagonal parities are distributed across disks in a decentralized manner to optimize partial stripe writes, and horizontal parity chains enable N-Code to reduce I/O costs of partial stripe writes by merging I/O operations; and (iii) balancing performance. Decentralized horizontal/diagonal parities potentially support the I/O balancing optimization for single writes. A theoretical analysis indicates that apart from the optimal storage efficiency, N-Code is featured with the optimal complexity for both encoding/decoding computations and update operations. The results of empirical experiments shows that N-Code demonstrates higher normal-read, degraded-read, and partial-stripe-write performance than the seven baseline popular RAID-6 codes. In particular, in the partial-stripe-write case, N-Code accelerates partial stripe writes by 32%-66% relative to horizontal codes; when it comes to degraded reads, N-Code improves degraded reads by 32%-53% compared to vertical codes. Furthermore, compared to the baseline codes, N-Code enhances load balancing by a factor anywhere between 1.19 to 9.09 for single-write workload, and between 1.3 to 6.92 for read-write mixed workload.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"211 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132233985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ketong Wu, Dezun Dong, Cunlu Li, Shan Huang, Yi Dai
{"title":"Network Congestion Avoidance through Packet-chaining Reservation","authors":"Ketong Wu, Dezun Dong, Cunlu Li, Shan Huang, Yi Dai","doi":"10.1145/3337821.3337874","DOIUrl":"https://doi.org/10.1145/3337821.3337874","url":null,"abstract":"Endpoint congestion is a bottleneck in high-performance computing (HPC) networks and severely impacts system performance, especially for latency-sensitive applications. For long messages (or flows) whose duration is far larger than the round-trip time (RTT), endpoint congestion can be effectively mitigated by proactive or reactive counter-measures such that the injection rate of each source is dynamically controlled to a proper level. However, many HPC applications produce a hybrid traffic, a mix of short and long messages, and are dominated by short messages. Existing proactive congestion avoidance methods face the great challenge of scheduling the rapidly changing traffic pattern caused by these short messages. In this paper, we leverage the advantages of proactive and reactive congestion avoidance techniques and propose the Packet-chaining Reservation Protocol (PCRP) to make a dynamic balance between flows following proactive scheduling and packets subjected to reactive network conditions. We select the chaining packets as a flexible reservation granularity between the whole flow and one packet. We allow small flows to be speculatively transmitted without being discarded and give them higher priority over the entire network. Our PCRP can respond quickly to network conditions and effectively avoid the formation of endpoint congestion and reduce the average flow delay. We conduct extensive experiments to evaluate our PCRP and compare it with the state-of-the-art proactive reservation-based protocols, Speculative Reservation Protocol (SRP) and Bilateral Flow Reservation Protocol (BFRP). The simulation results demonstrate that in our design the flow latency can be reduced by 50.2% for hotspot traffic and 28.38% for uniform traffic.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"71 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130851582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mathieu Bacou, Grégoire Todeschi, A. Tchana, D. Hagimont
{"title":"Nested Virtualization Without the Nest","authors":"Mathieu Bacou, Grégoire Todeschi, A. Tchana, D. Hagimont","doi":"10.1145/3337821.3337840","DOIUrl":"https://doi.org/10.1145/3337821.3337840","url":null,"abstract":"With the increasing popularity of containers, managing them on top of virtual machines becomes a common practice, called nested virtualization. This paper presents BrFusion and Hostlo, two solutions that address each of two networking issues of nested virtualization: network virtualization duplication and virtual machine-bounded pod deployments. The first issue lengthens network packet paths while the second issue leads to resource fragmentation. For instance, in respect with the first issue, we measured a throughput degradation of about 68% and a latency increase of about 31% in comparison with a single networking layer. We prototype BrFusion and Hostlo in Linux KVM/QEMU, Docker and Kubernetes systems. The evaluation results show that BrFusion leads to the same performance as a single-layer virtualization deployment. Concerning Hostlo, the results show that more than 11% of cloud clients see their cloud utilization cost reduced by down to 40%.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116688964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hui Sun, Wei Liu, Jianzhong Huang, Song Fu, Zhi Qiao, Weisong Shi
{"title":"Near-Data Processing-Enabled and Time-Aware Compaction Optimization for LSM-tree-based Key-Value Stores","authors":"Hui Sun, Wei Liu, Jianzhong Huang, Song Fu, Zhi Qiao, Weisong Shi","doi":"10.1145/3337821.3337855","DOIUrl":"https://doi.org/10.1145/3337821.3337855","url":null,"abstract":"With the growing volume of storage systems, the traditional relational databases cannot reach the high performance required by big-data applications. As high-throughput alternatives to relational databases, LSM-tree-based key-value stores (KV stores in short) are confronted with degraded write performance during compaction under update-intensive workloads. To address this issue, we design and implement a time-aware compaction optimization framework for KV stores called TStore. TStore explores the near-data processing (i.e., NDP) model. It dynamically partitions compaction tasks into both host and NDP-enabled device to minimize the total time of compaction. The partitioned compaction tasks are conducted by the host and the device in parallel. The NDP-based devices exhibit low-latency, high-performance and high-bandwidth capability, thus facilitating key-value stores. TStore can not only accomplish compaction for KV stores, but also improve overall performance by removing bottleneck in compaction. Results show that the TStore with an NDP framework can achieve 3.8x and 1.9x performance improvement over LevelDB and Co-KV under the db_bench workload. In addition, the TStore-enabled KV store outperforms LevelDB and Co-KV by a factor of 3.6x and 1.9x in throughput and 72.0% and 48.9% in latency, respectively, under realistic workloads generated by YCSB.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127148456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Da Yan, James Cheng, Hongzhi Chen, Cheng Long, P. Bangalore
{"title":"Lightweight Fault Tolerance in Pregel-Like Systems","authors":"Da Yan, James Cheng, Hongzhi Chen, Cheng Long, P. Bangalore","doi":"10.1145/3337821.3337823","DOIUrl":"https://doi.org/10.1145/3337821.3337823","url":null,"abstract":"Pregel-like systems are popular for iterative graph processing thanks to their user-friendly vertex-centric programming model. However, existing Pregel-like systems only adopt a naïve checkpointing approach for fault tolerance, which saves a large amount of data about the state of computation and significantly degrades the failure-free execution performance. Advanced fault tolerance/recovery techniques are left unexplored in the context of Pregel-like systems. This paper proposes a non-invasive lightweight checkpointing (LWCP) scheme which minimizes the data saved to each checkpoint, and additional data required for recovery are generated online from the saved data. This improvement results in 10x speedup in checkpointing, and an integration of it with a recently proposed log-based recovery approach can further speed up recovery when failure occurs. Extensive experiments verified that our proposed LWCP techniques are able to significantly improve the performance of both checkpointing and recovery in a Pregel-like system.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121782264","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"How to Make the Preconditioned Conjugate Gradient Method Resilient Against Multiple Node Failures","authors":"C. Pachajoa, M. Levonyak, W. Gansterer, J. Träff","doi":"10.1145/3337821.3337849","DOIUrl":"https://doi.org/10.1145/3337821.3337849","url":null,"abstract":"We study algorithmic approaches for recovering from the failure of several compute nodes in the parallel preconditioned conjugate gradient (PCG) solver on large-scale parallel computers. In particular, we analyze and extend an exact state reconstruction (ESR) approach, which is based on a method proposed by Chen (2011). In the ESR approach, the solver keeps redundant information from previous search directions, so that the solver state can be fully reconstructed if a node fails unexpectedly. ESR does not require checkpointing or external storage for saving dynamic solver data and has low overhead compared to the failure-free situation. In this paper, we improve the fault tolerance of the PCG algorithm based on the ESR approach. In particular, we support recovery from simultaneous or overlapping failures of several nodes for general sparsity patterns of the system matrix, which cannot be handled by Chen's method. For this purpose, we refine the strategy for how to store redundant information across nodes. We analyze and implement our new method and perform numerical experiments with large sparse matrices from real-world applications on 128 nodes of the Vienna Scientific Cluster (VSC). For recovering from three simultaneous node failures we observe average runtime overheads between only 2.8% and 55.0%. The overhead of the improved resilience depends on the sparsity pattern of the system matrix.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129966451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A 2D Parallel Triangle Counting Algorithm for Distributed-Memory Architectures","authors":"A. Tom, G. Karypis","doi":"10.1145/3337821.3337853","DOIUrl":"https://doi.org/10.1145/3337821.3337853","url":null,"abstract":"Triangle counting is a fundamental graph analytic operation that is used extensively in network science and graph mining. As the size of the graphs that needs to be analyzed continues to grow, there is a requirement in developing scalable algorithms for distributed-memory parallel systems. To this end, we present a distributed-memory triangle counting algorithm, which uses a 2D cyclic decomposition to balance the computations and reduce the communication overheads. The algorithm structures its communication and computational steps such that it reduces its memory overhead and includes key optimizations that leverage the sparsity of the graph and the way the computations are structured. Experiments on synthetic and real-world graphs show that our algorithm obtains an average relative speedup range between 3.24 to 7.22 out of 10.56 across the datasets using 169 MPI ranks over the performance achieved by 16 MPI ranks. Moreover, we obtain an average speedup of 10.2 times on comparison with previously developed distributed-memory parallel algorithms.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122544308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Hückelheim, Navjot Kukreja, S. Narayanan, F. Luporini, G. Gorman, P. Hovland
{"title":"Automatic Differentiation for Adjoint Stencil Loops","authors":"J. Hückelheim, Navjot Kukreja, S. Narayanan, F. Luporini, G. Gorman, P. Hovland","doi":"10.1145/3337821.3337906","DOIUrl":"https://doi.org/10.1145/3337821.3337906","url":null,"abstract":"Stencil loops are a common motif in computations including convolutional neural networks, structured-mesh solvers for partial differential equations, and image processing. Stencil loops are easy to parallelise, and their fast execution is aided by compilers, libraries, and domain-specific languages. Reverse-mode automatic differentiation, also known as algorithmic differentiation, autodiff, adjoint differentiation, or back-propagation, is sometimes used to obtain gradients of programs that contain stencil loops. Unfortunately, conventional automatic differentiation results in a memory access pattern that is not stencil-like and not easily parallelisable. In this paper we present a novel combination of automatic differentiation and loop transformations that preserves the structure and memory access pattern of stencil loops, while computing fully consistent derivatives. The generated loops can be parallelised and optimised for performance in the same way and using the same tools as the original computation. We have implemented this new technique in the Python tool PerforAD, which we release with this paper along with test cases derived from seismic imaging and computational fluid dynamics applications.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134470738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}