Proceedings of the 51st International Conference on Parallel Processing最新文献_第7页

DRAM Cache Management with Request Granularity for NAND-based SSDs 基于nand的ssd的请求粒度的DRAM缓存管理

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3545008.3545081

Haodong Lin, Zhibing Sha, Jun Li, Zhigang Cai, Balazs Gerofi, Yuanquan Shi, Jianwei Liao

{"title":"DRAM Cache Management with Request Granularity for NAND-based SSDs","authors":"Haodong Lin, Zhibing Sha, Jun Li, Zhigang Cai, Balazs Gerofi, Yuanquan Shi, Jianwei Liao","doi":"10.1145/3545008.3545081","DOIUrl":"https://doi.org/10.1145/3545008.3545081","url":null,"abstract":"Most flash-based solid-state drives (SSDs) employ an on-board Dynamic Random Access Memory (DRAM) to cache hot data at the SSD page granularity. This can significantly reduce the number of flush operations to the underlying arrays of SSDs given that there is sufficient locality in the applications’ I/O access pattern. We observe, however, that in most I/O workloads over SSDs the buffered data of small sized requests are more likely to be re-accessed than those of larger requests, which also require more DRAM space for caching their data. To improve the efficiency of the DRAM cache inside SSDs, this paper presents a request granularity-based cache management scheme, called Req-block. The proposed mechanism manages cached data according to the size of write requests and supports multi-level linked lists for sifting the cached data blocks (termed as request blocks), by taking both their size and hotness into account. Comprehensive evaluation shows that our proposal improves cache hits by up to 90.5%, and decreases I/O latency by 14.3% on average, compared to existing state-of-the-art SSD cache management schemes.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129453283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

SPAMeR: Speculative Push for Anticipated Message Requests in Multi-Core Systems SPAMeR:多核系统中预期消息请求的推测性推送

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3545008.3545044

Qinzhe Wu, Ashen Ekanayake, Ruihao Li, J. Beard, L. John

{"title":"SPAMeR: Speculative Push for Anticipated Message Requests in Multi-Core Systems","authors":"Qinzhe Wu, Ashen Ekanayake, Ruihao Li, J. Beard, L. John","doi":"10.1145/3545008.3545044","DOIUrl":"https://doi.org/10.1145/3545008.3545044","url":null,"abstract":"With increasing core counts and multiple levels of cache memories, scaling multi-threaded and task-level parallel workloads is continuously becoming a challenge. A key challenge to scaling the number of communicating tasks (or threads) is the rate at which existing communication mechanisms scale (in terms of latency and bandwidth). Architectures with hardware accelerated queuing operations have the potential to reduce the latency and improve scalability of moving data between processing elements, reducing synchronization penalties, and thereby improving the performance of task-level parallel workloads. While hardware queues reduce synchronization penalties, they cannot fully hide load-to-use latency, i.e., perfect pipelines often are not realized. There is the potential, however, for better overlap. If the inter-processor communication latency is equal to or less than the time spent processing a message at the consumer, any and all latency may be overlapped while the consumer is processing. We exploit this property to speedup parallel applications above and beyond existing hardware queues. In this paper, we present SPAMeR, a speculation mechanism built on top of a state-of-the-art hardware-driven message queue architecture. SPAMeR has the capability to speculatively push messages in anticipation of consumer message requests. Unlike pre-fetch approaches which predict what addresses to fetch next, with a queue we know exactly what data is needed next but not when it is needed; SPAMeR adds algorithms that attempt to predict this. We evaluate the effectiveness of SPAMeR with a set of diverse task-parallel benchmarks utilizing the gem5 full system simulator, and observe a 1.33 × average speedup.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127939204","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ADSTS: Automatic Distributed Storage Tuning System Using Deep Reinforcement Learning ADSTS:使用深度强化学习的自动分布式存储调谐系统

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3545008.3545012

Kai Lu, Guokuan Li, Ji-guang Wan, Ruixiang Ma, Wei Zhao

{"title":"ADSTS: Automatic Distributed Storage Tuning System Using Deep Reinforcement Learning","authors":"Kai Lu, Guokuan Li, Ji-guang Wan, Ruixiang Ma, Wei Zhao","doi":"10.1145/3545008.3545012","DOIUrl":"https://doi.org/10.1145/3545008.3545012","url":null,"abstract":"Modern distributed storage systems with the immense number of configurations, unpredictable workloads and difficult performance evaluation pose higher requirements to parameter tuning. Providing an automatic parameter tuning solution for distributed storage systems is in demand. Lots of researches have attempted to build automatic tuning systems based on deep reinforcement learning (RL). However, they have several limitations in the face of these requirements, including lack of parameter spaces processing, less advanced RL models and time-consuming and unstable training process. In this paper, we present and evaluate the ADSTS, which is an automatic distributed storage tuning system based on deep reinforcement learning. A general preprocessing guideline is first proposed to generate standardized tunable parameter domain. Thereinto, Recursive Stratified Sampling without the nonincremental nature is designed to sample huge parameter spaces and Lasso regression is adopted to identify important parameters. Besides, the twin-delayed deep deterministic policy gradient method is utilized to find the optimal values of tunable parameters. Finally, Multi-processing Training and Workload-directed Model Fine-tuning are adopted to accelerate the model convergence. ADSTS is implemented on Park and is used in the real-world system Ceph. The evaluation results show that ADSTS can recommend near-optimal configurations and improve system performance by 1.5 × ∼2.5 × with acceptable overheads.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126843526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GraphSD: A State and Dependency aware Out-of-Core Graph Processing System 一个状态和依赖感知的核外图处理系统

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3545008.3545039

Xianghao Xu, Hong Jiang, Fang Wang, Yongli Cheng, Peng Fang

{"title":"GraphSD: A State and Dependency aware Out-of-Core Graph Processing System","authors":"Xianghao Xu, Hong Jiang, Fang Wang, Yongli Cheng, Peng Fang","doi":"10.1145/3545008.3545039","DOIUrl":"https://doi.org/10.1145/3545008.3545039","url":null,"abstract":"In recent years, system researchers have proposed many out-of-core graph processing systems to efficiently handle graphs that exceed the memory capacity of a single machine. Through disk-friendly graph data organizations and well-designed execution engines, existing out-of-core graph processing systems can maintain sequential locality on disk access and greatly reduce disk I/Os during processing. However, they have not fully explored the characteristics of graph data and algorithm execution to further reduce disk I/Os, leaving significant room for performance improvement. In this paper, we present a novel out-of-core graph processing system called GraphSD, which optimizes the I/O traffic by simultaneously capturing the state and dependency of graph data during computation. At the heart of GraphSD is a state- and dependency-aware update strategy that includes two adaptive update models, selective cross-iteration update (SCIU) and full cross-iteration update (FCIU). These two update models are dynamically triggered at runtime to enable active-vertex aware processing and cross-iteration vertex value computation, which avoid loading inactive edges and reduce disk I/Os in the future iterations. Moreover, an efficient sub-block based buffering scheme is proposed to further minimize I/O overheads. Our evaluation results show that GraphSD outperforms two state-of-the-art out-of-core graph processing systems HUS-Graph and Lumos by up to 2.7 × and 3.9 × respectively.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124413756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Online Resource Optimization for Elastic Stream Processing with Regret Guarantee 带遗憾保证的弹性流处理在线资源优化

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3545008.3545063

Yang Liu, Huanle Xu, W. Lau

{"title":"Online Resource Optimization for Elastic Stream Processing with Regret Guarantee","authors":"Yang Liu, Huanle Xu, W. Lau","doi":"10.1145/3545008.3545063","DOIUrl":"https://doi.org/10.1145/3545008.3545063","url":null,"abstract":"Recognizing the explosion of large-scale real-time analytics needs, a plethora of stream processing systems, such as Apache Storm and Flink, have been developed to support such applications. Under these systems, a stream processing application is realized as a directed acyclic graph (DAG) of operators, where the resource configuration of each operator has a significant impact on its overall throughput and latency performance. However, there is a lack of dynamic resource allocation schemes, which are theoretically sound and practically implementable, especially under the drastically changing offered load. To address this challenge, we present Dragster1, an online-optimization-based dynamic resource allocation scheme for elastic stream processing. By combining the online optimization framework with upper confidence bound (UCB) techniques, Dragster can guarantee, in expectation, a sub-linear increase in the throughput regret w.r.t. time. To demonstrate the efficacy, we implement Dragster to improve the throughput of Flink applications over Kubernetes. Compared to the state-of-the-art algorithm Dhalion, Dragster can achieve a 1.8X-2.2X speed-up in converging to the optimal configuration. It can contribute to 20.0%-25.8% gain in tuple-processing goodput and 14.6%-15.6% cost-savings.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129857470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Accelerating Parallel First-Principles Excited-State Calculation by Low-Rank Approximation with K-Means Clustering 基于k均值聚类的低秩近似加速并行第一性原理激发态计算

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3545008.3545092

Qingcai Jiang, Jielan Li, Junshi Chen, Xinming Qin, Lingyun Wan, Jinlong Yang, Jie Liu, Wei Hu, Hong An

{"title":"Accelerating Parallel First-Principles Excited-State Calculation by Low-Rank Approximation with K-Means Clustering","authors":"Qingcai Jiang, Jielan Li, Junshi Chen, Xinming Qin, Lingyun Wan, Jinlong Yang, Jie Liu, Wei Hu, Hong An","doi":"10.1145/3545008.3545092","DOIUrl":"https://doi.org/10.1145/3545008.3545092","url":null,"abstract":"First-principles time-dependent density functional theory (TDDFT) is a powerful tool to accurately describe the excited-state properties of molecules and solids in condensed matter physics, computational chemistry and materials science. However, a perceived drawback in TDDFT calculations is its ultrahigh computational cost and large memory usage especially for plane-wave basis set, confining its applications to large systems containing thousands of atoms. Here, we present a massively parallel implementation of linear-response TDDFT (LR-TDDFT) and reduce the complexity to by combining K-Means clustering based low-rank approximation with iterative eigensolve algorithm. Furthermore, we carefully design the parallel data and task distribution schemes to accommodate with the physical nature in different steps of the computation, also, several optimization methods are employed to effectively handle the matrix operations and data communications of constructing and diagonalizing the LR-TDDFT Hamiltonian. In particular, our method can significantly reduce the cost of computation and memory by nearly 2 orders of magnitude compared to conventional LR-TDDFT calculations. Numerical results demonstrate that our implementation can gain an overall speedup of 10x and efficiently scale up to 12,288 CPU cores for large systems up to 4,096 atoms within dozens of seconds.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131756700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TileSpMSpV: A Tiled Algorithm for Sparse Matrix-Sparse Vector Multiplication on GPUs TileSpMSpV: gpu上稀疏矩阵-稀疏向量乘法的平铺算法

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3545008.3545028

H. Ji, Huimin Song, Shibo Lu, Zhou Jin, Guangming Tan, Weifeng Liu

{"title":"TileSpMSpV: A Tiled Algorithm for Sparse Matrix-Sparse Vector Multiplication on GPUs","authors":"H. Ji, Huimin Song, Shibo Lu, Zhou Jin, Guangming Tan, Weifeng Liu","doi":"10.1145/3545008.3545028","DOIUrl":"https://doi.org/10.1145/3545008.3545028","url":null,"abstract":"Sparse matrix-sparse vector multiplication (SpMSpV) is an important primitive for graph algorithms and machine learning applications. The sparsity of the input and output vectors makes its floating point efficiency in general lower than sparse matrix-vector multiplication (SpMV) and sparse matrix-matrix multiplication (SpGEMM). Existing parallel SpMSpV methods focused on various row- and column-wise storage formats and merging operations. However, the data locality and sparsity pattern of the input matrix and vector are largely ignored. We in this paper propose TileSpMSpV, a tiled algorithm for accelerating SpMSpV on GPUs. Firstly, tile-wise storage structures are developed for fast positioning a group of nonzeros in matrix and vectors. Then, we develop the TileSpMSpV algorithm on top of the storage structures. In addition, to accelerate directional optimization breadth-first search (BFS) by using TileSpMSpV, we propose a TileBFS algorithm including three kernels called Push-CSC, Push-CSR and Pull-CSC. In the experiments running on a high-end NVIDIA GPU and using 2757 sparse matrices, the TileSpMSpV algorithm outperforms TileSpMV, cuSPARSE and CombBLAS by a factor of on average 1.83, 17.18 and 17.20 (up to 7.68, 1050.02 and 235.90), respectively. Moreover, our TileBFS algorithm outperforms Gunrock and GSwitch by a factor of on average 2.88 and 4.52 (up to 21.35 and 1000.85), respectively.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"520 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131825312","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

From RTL to CUDA: A GPU Acceleration Flow for RTL Simulation with Batch Stimulus 从RTL到CUDA:批量刺激下RTL仿真的GPU加速流程

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3545008.3545091

Dian-Lun Lin, Haoxing Ren, Yanqing Zhang, Brucek Khailany, Tsung-Wei Huang

引用次数: 10

Repair-Optimal Data Placement for Locally Repairable Codes with Optimal Minimum Hamming Distance 具有最优最小汉明距离的局部可修码的修复-最优数据放置

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3545008.3545038

Shuang Ma, Si Wu, Cheng Li, Yinlong Xu

{"title":"Repair-Optimal Data Placement for Locally Repairable Codes with Optimal Minimum Hamming Distance","authors":"Shuang Ma, Si Wu, Cheng Li, Yinlong Xu","doi":"10.1145/3545008.3545038","DOIUrl":"https://doi.org/10.1145/3545008.3545038","url":null,"abstract":"Modern clustered storage systems increasingly adopt erasure coding to realize reliable data storage at low storage redundancy. Locally Repairable Codes (LRC) are a family of practical erasure codes with high repair efficiency. Among various LRC constructions, Optimal-LRC is a recently proposed LRC approach that achieves the optimal Minimum Hamming Distance with low theoretical repair costs. In this paper, we consider the repair performance of Optimal-LRC in clustered storage systems. We show that the conventional flat data placement and random data placement incur substantial cross-cluster repair traffic, which impairs the repair performance. To this end, we design an optimal data placement scheme that provably minimizes the cross-cluster repair traffic, by carefully placing each group of blocks in Optimal-LRC into a minimum number of clusters subject to single-cluster fault tolerance. We implement our optimal data placement scheme on a key-value store prototype atop Memcached, and show via LAN testbed experiments that the optimal data placement significantly improves the repair performance compared to the conventional data placements.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115390918","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

HSP: Hybrid Synchronous Parallelism for Fast Distributed Deep Learning HSP:快速分布式深度学习的混合同步并行

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3545008.3545024

Yijun Li, Jiawei Huang, Zhaoyi Li, Shengwen Zhou, Wanchun Jiang, Jianxin Wang

{"title":"HSP: Hybrid Synchronous Parallelism for Fast Distributed Deep Learning","authors":"Yijun Li, Jiawei Huang, Zhaoyi Li, Shengwen Zhou, Wanchun Jiang, Jianxin Wang","doi":"10.1145/3545008.3545024","DOIUrl":"https://doi.org/10.1145/3545008.3545024","url":null,"abstract":"In the parameter-server-based distributed deep learning system, the workers simultaneously communicate with the parameter server to refine model parameters, easily resulting in severe network contention. To solve this problem, Asynchronous Parallel (ASP) strategy enables each worker to update the parameter independently without synchronization. However, due to the inconsistency of parameters among workers, ASP experiences accuracy loss and slow convergence. In this paper, we propose Hybrid Synchronous Parallelism (HSP), which mitigates the communication contention without excessive degradation of convergence speed. Specifically, the parameter server sequentially pulls gradients from workers to eliminate network congestion and synchronizes all up-to-date parameters after each iteration. Meanwhile, HSP cautiously lets idle workers to compute with out-of-date weights to maximize the utilizations of computing resources. We provide theoretical analysis of convergence efficiency and implement HSP on popular deep learning (DL) framework. The test results show that HSP improves the convergence speedup of three classical deep learning models by up to 67%.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115815391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1