Proceedings of the 48th International Conference on Parallel Processing最新文献_第10页

Building Scalable NVM-based B+tree with HTM 使用HTM构建可扩展的基于nvm的B+树

Proceedings of the 48th International Conference on Parallel Processing Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337827

Mengxing Liu, Jiankai Xing, Kang Chen, Yongwei Wu

{"title":"Building Scalable NVM-based B+tree with HTM","authors":"Mengxing Liu, Jiankai Xing, Kang Chen, Yongwei Wu","doi":"10.1145/3337821.3337827","DOIUrl":"https://doi.org/10.1145/3337821.3337827","url":null,"abstract":"Emerging on-volatile memory (NVM) opens an opportunity to build durable data structures. However, to build a highly efficient complex data structure like B+tree on NVM is not easy. We investigate the essential performance bottleneck for NVM-based B+tree. Even with a single-core CPU, the performance is limited by the atomic-write size which plays an essential role in the trade-off between the persistent overhead and keeping leaf node entries sorted. For the multi-core setting, the overlapping of concurrency and persistency is key to the system scalability. Based on the analysis, we propose RNTree, a durable NVM-based B+tree using the hardware transactional memory (HTM). Our way of using HTM can actually address both problems mentioned above simultaneously. (1) HTM can use cache-line granularity to provide larger atomic-write size. Based on this, we propose a new slot-array approach which traces the order of entries in the leaf nodes while still reducing the number of persistent instructions. (2) With careful design, RNTree moves slow persistent instructions out of critical sections and proposes the dual slot array design, to extract more concurrency. For single thread, RNTree achieves 1.44×/4.2× higher throughput for single-key operations and range queries respectively. For multiple threads, the throughput of RNTree is 2.3× higher than state-of-the-art works.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122023692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Tessellating Star Stencils 镶嵌星星模板

Proceedings of the 48th International Conference on Parallel Processing Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337835

Liang Yuan, Shan Huang, Yunquan Zhang, Hang Cao

引用次数: 8

N-Code: An Optimal RAID-6 MDS Array Code for Load Balancing and High I/O Performance N-Code:负载均衡和高I/O性能的最优RAID-6 MDS阵列代码

Proceedings of the 48th International Conference on Parallel Processing Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337829

Ping Xie, Zhu Yuan, Jianzhong Huang, X. Qin

{"title":"N-Code: An Optimal RAID-6 MDS Array Code for Load Balancing and High I/O Performance","authors":"Ping Xie, Zhu Yuan, Jianzhong Huang, X. Qin","doi":"10.1145/3337821.3337829","DOIUrl":"https://doi.org/10.1145/3337821.3337829","url":null,"abstract":"Existing RAID-6 codes are developed to optimize either reads or writes for storage systems. To improve both read and write operations, this paper proposes a novel RAID-6 MDS array code called N-Code. N-Code exhibits three aspects of salient features: (i) read performance. N-Code assigns both horizontal parity chains and horizontal parities across disks, without generating a dedicated parity disk. Such a parity layout not only makes all the disks service normal reads, but also allows continuous data elements to share the same horizontal chain to optimize degraded reads; (ii) write performance. Diagonal parities are distributed across disks in a decentralized manner to optimize partial stripe writes, and horizontal parity chains enable N-Code to reduce I/O costs of partial stripe writes by merging I/O operations; and (iii) balancing performance. Decentralized horizontal/diagonal parities potentially support the I/O balancing optimization for single writes. A theoretical analysis indicates that apart from the optimal storage efficiency, N-Code is featured with the optimal complexity for both encoding/decoding computations and update operations. The results of empirical experiments shows that N-Code demonstrates higher normal-read, degraded-read, and partial-stripe-write performance than the seven baseline popular RAID-6 codes. In particular, in the partial-stripe-write case, N-Code accelerates partial stripe writes by 32%-66% relative to horizontal codes; when it comes to degraded reads, N-Code improves degraded reads by 32%-53% compared to vertical codes. Furthermore, compared to the baseline codes, N-Code enhances load balancing by a factor anywhere between 1.19 to 9.09 for single-write workload, and between 1.3 to 6.92 for read-write mixed workload.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"211 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132233985","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Network Congestion Avoidance through Packet-chaining Reservation 通过分组链保留避免网络拥塞

Proceedings of the 48th International Conference on Parallel Processing Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337874

Ketong Wu, Dezun Dong, Cunlu Li, Shan Huang, Yi Dai

{"title":"Network Congestion Avoidance through Packet-chaining Reservation","authors":"Ketong Wu, Dezun Dong, Cunlu Li, Shan Huang, Yi Dai","doi":"10.1145/3337821.3337874","DOIUrl":"https://doi.org/10.1145/3337821.3337874","url":null,"abstract":"Endpoint congestion is a bottleneck in high-performance computing (HPC) networks and severely impacts system performance, especially for latency-sensitive applications. For long messages (or flows) whose duration is far larger than the round-trip time (RTT), endpoint congestion can be effectively mitigated by proactive or reactive counter-measures such that the injection rate of each source is dynamically controlled to a proper level. However, many HPC applications produce a hybrid traffic, a mix of short and long messages, and are dominated by short messages. Existing proactive congestion avoidance methods face the great challenge of scheduling the rapidly changing traffic pattern caused by these short messages. In this paper, we leverage the advantages of proactive and reactive congestion avoidance techniques and propose the Packet-chaining Reservation Protocol (PCRP) to make a dynamic balance between flows following proactive scheduling and packets subjected to reactive network conditions. We select the chaining packets as a flexible reservation granularity between the whole flow and one packet. We allow small flows to be speculatively transmitted without being discarded and give them higher priority over the entire network. Our PCRP can respond quickly to network conditions and effectively avoid the formation of endpoint congestion and reduce the average flow delay. We conduct extensive experiments to evaluate our PCRP and compare it with the state-of-the-art proactive reservation-based protocols, Speculative Reservation Protocol (SRP) and Bilateral Flow Reservation Protocol (BFRP). The simulation results demonstrate that in our design the flow latency can be reduced by 50.2% for hotspot traffic and 28.38% for uniform traffic.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"71 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130851582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 13

Nested Virtualization Without the Nest 无巢的嵌套虚拟化

Proceedings of the 48th International Conference on Parallel Processing Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337840

Mathieu Bacou, Grégoire Todeschi, A. Tchana, D. Hagimont

引用次数: 4

Near-Data Processing-Enabled and Time-Aware Compaction Optimization for LSM-tree-based Key-Value Stores 基于lsm树的键值存储的近数据处理和时间感知压缩优化

Proceedings of the 48th International Conference on Parallel Processing Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337855

Hui Sun, Wei Liu, Jianzhong Huang, Song Fu, Zhi Qiao, Weisong Shi

{"title":"Near-Data Processing-Enabled and Time-Aware Compaction Optimization for LSM-tree-based Key-Value Stores","authors":"Hui Sun, Wei Liu, Jianzhong Huang, Song Fu, Zhi Qiao, Weisong Shi","doi":"10.1145/3337821.3337855","DOIUrl":"https://doi.org/10.1145/3337821.3337855","url":null,"abstract":"With the growing volume of storage systems, the traditional relational databases cannot reach the high performance required by big-data applications. As high-throughput alternatives to relational databases, LSM-tree-based key-value stores (KV stores in short) are confronted with degraded write performance during compaction under update-intensive workloads. To address this issue, we design and implement a time-aware compaction optimization framework for KV stores called TStore. TStore explores the near-data processing (i.e., NDP) model. It dynamically partitions compaction tasks into both host and NDP-enabled device to minimize the total time of compaction. The partitioned compaction tasks are conducted by the host and the device in parallel. The NDP-based devices exhibit low-latency, high-performance and high-bandwidth capability, thus facilitating key-value stores. TStore can not only accomplish compaction for KV stores, but also improve overall performance by removing bottleneck in compaction. Results show that the TStore with an NDP framework can achieve 3.8x and 1.9x performance improvement over LevelDB and Co-KV under the db_bench workload. In addition, the TStore-enabled KV store outperforms LevelDB and Co-KV by a factor of 3.6x and 1.9x in throughput and 72.0% and 48.9% in latency, respectively, under realistic workloads generated by YCSB.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127148456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Lightweight Fault Tolerance in Pregel-Like Systems 类预凝胶系统中的轻量级容错

Proceedings of the 48th International Conference on Parallel Processing Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337823

Da Yan, James Cheng, Hongzhi Chen, Cheng Long, P. Bangalore

引用次数: 6

How to Make the Preconditioned Conjugate Gradient Method Resilient Against Multiple Node Failures 如何使预条件共轭梯度法适应多节点故障

Proceedings of the 48th International Conference on Parallel Processing Pub Date : 2019-07-30 DOI: 10.1145/3337821.3337849

C. Pachajoa, M. Levonyak, W. Gansterer, J. Träff

{"title":"How to Make the Preconditioned Conjugate Gradient Method Resilient Against Multiple Node Failures","authors":"C. Pachajoa, M. Levonyak, W. Gansterer, J. Träff","doi":"10.1145/3337821.3337849","DOIUrl":"https://doi.org/10.1145/3337821.3337849","url":null,"abstract":"We study algorithmic approaches for recovering from the failure of several compute nodes in the parallel preconditioned conjugate gradient (PCG) solver on large-scale parallel computers. In particular, we analyze and extend an exact state reconstruction (ESR) approach, which is based on a method proposed by Chen (2011). In the ESR approach, the solver keeps redundant information from previous search directions, so that the solver state can be fully reconstructed if a node fails unexpectedly. ESR does not require checkpointing or external storage for saving dynamic solver data and has low overhead compared to the failure-free situation. In this paper, we improve the fault tolerance of the PCG algorithm based on the ESR approach. In particular, we support recovery from simultaneous or overlapping failures of several nodes for general sparsity patterns of the system matrix, which cannot be handled by Chen's method. For this purpose, we refine the strategy for how to store redundant information across nodes. We analyze and implement our new method and perform numerical experiments with large sparse matrices from real-world applications on 128 nodes of the Vienna Scientific Cluster (VSC). For recovering from three simultaneous node failures we observe average runtime overheads between only 2.8% and 55.0%. The overhead of the improved resilience depends on the sparsity pattern of the system matrix.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129966451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

A 2D Parallel Triangle Counting Algorithm for Distributed-Memory Architectures 分布式存储体系结构的二维并行三角形计数算法

Proceedings of the 48th International Conference on Parallel Processing Pub Date : 2019-07-22 DOI: 10.1145/3337821.3337853

A. Tom, G. Karypis

引用次数: 5

Automatic Differentiation for Adjoint Stencil Loops 伴随模板环的自动判别

Proceedings of the 48th International Conference on Parallel Processing Pub Date : 2019-07-05 DOI: 10.1145/3337821.3337906

J. Hückelheim, Navjot Kukreja, S. Narayanan, F. Luporini, G. Gorman, P. Hovland

{"title":"Automatic Differentiation for Adjoint Stencil Loops","authors":"J. Hückelheim, Navjot Kukreja, S. Narayanan, F. Luporini, G. Gorman, P. Hovland","doi":"10.1145/3337821.3337906","DOIUrl":"https://doi.org/10.1145/3337821.3337906","url":null,"abstract":"Stencil loops are a common motif in computations including convolutional neural networks, structured-mesh solvers for partial differential equations, and image processing. Stencil loops are easy to parallelise, and their fast execution is aided by compilers, libraries, and domain-specific languages. Reverse-mode automatic differentiation, also known as algorithmic differentiation, autodiff, adjoint differentiation, or back-propagation, is sometimes used to obtain gradients of programs that contain stencil loops. Unfortunately, conventional automatic differentiation results in a memory access pattern that is not stencil-like and not easily parallelisable. In this paper we present a novel combination of automatic differentiation and loop transformations that preserves the structure and memory access pattern of stencil loops, while computing fully consistent derivatives. The generated loops can be parallelised and optimised for performance in the same way and using the same tools as the original computation. We have implemented this new technique in the Python tool PerforAD, which we release with this paper along with test cases derived from seismic imaging and computational fluid dynamics applications.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134470738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14