Proceedings of the 51st International Conference on Parallel Processing最新文献_第4页

Mlog: Multi-log Write Buffer upon Ultra-fast SSD RAID Mlog: Multi-log Write Buffer on Ultra-fast SSD RAID

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3545008.3545034

Shucheng Wang, Q. Cao, Ziyi Lu, Jie Yao

{"title":"Mlog: Multi-log Write Buffer upon Ultra-fast SSD RAID","authors":"Shucheng Wang, Q. Cao, Ziyi Lu, Jie Yao","doi":"10.1145/3545008.3545034","DOIUrl":"https://doi.org/10.1145/3545008.3545034","url":null,"abstract":"Parity-based RAID suffering from partial-stripe write-penalty has to introduce write buffer to fast absorb and merge incoming writes, and then flush them to RAID array in batch. However, we experimentally observe that the popular buffering mechanism as Linux RAID journal and partial parity logging (PPL) becomes a bottleneck for ultra-fast SSD-based RAID, and we further uncover that the centralized log-buffer model is the prime cause. In this paper, we propose a highly-parallel multi-Log RAID write buffer, Mlog, employing a two-dimensional log data-layout and a parallel I/O processing model to fully exploit both intra-/inter-SSDs parallelism. Specifically, Mlog partitions the global log-buffer into a set of SSD-zones located within each SSDs in horizontal, and the SSD-zone further divided into sublogs in vertical. Each sublog adopts a fine-grained submission lock. A set of sublogs with the same logical offset across different SSDs are combined into a LogGroup. Moreover, Mlog presents a two-phase write allocation to hash an incoming request to a LogGroup, and then strategically writes it to a dedicated sublog, thus providing highly-parallel logging writes. Mlog schedules part of LogGroups to serve incoming requests while reclaiming the others in the background, unleashing the internal-parallelism of SSDs. Finally, Mlog generates parities for the buffered user data within a LogGroup, enhancing data reliability. We evaluate Mlog with a variety of benchmarks and real-world traces. Mlog is shown to consistently outperform Linux RAID journal and PPL by up to 122 × in the write throughput under intensive workloads.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127459646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Multi Resource Scheduling with Task Cloning in Heterogeneous Clusters 异构集群多资源调度与任务克隆

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3545008.3545093

Huanle Xu, Yang Liu, W. Lau

引用次数: 1

Postmortem Computation of Pagerank on Temporal Graphs 时间图上网页排名的事后计算

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3545008.3545055

M. Hossain, Erik Saule

{"title":"Postmortem Computation of Pagerank on Temporal Graphs","authors":"M. Hossain, Erik Saule","doi":"10.1145/3545008.3545055","DOIUrl":"https://doi.org/10.1145/3545008.3545055","url":null,"abstract":"Temporal graphs capture changes in relational data over time and have been of increasing interest to data analysts. Most research focuses on streaming algorithms that incrementally update an analysis to account for the changes in the graph. However, one can also be interested in understanding the nature of changes in the graph over time. In such a case, they perform a postmortem analysis on different points in time where all the data known in advance We study in this paper a postmortem analysis of Pagerank over-time on graphs that are defined by temporal relational event databases. A relation between two entities at a particular point in time will form an edge between these two entities and that will remain in the graph for a fixed period of time. While one can reuse a streaming algorithm for that purpose, leveraging the availability of all the data from the beginning can be beneficial. Postmortem analysis enables encoding the temporal graph with a more efficient graph representation. Also, it provides an additional level of parallelism since one can not only parallelize within a particular timestamp but also across different timestamps. We will show that depending on the properties of the temporal data, either parallelization can be better, and in some cases, a combination of both approaches is preferable. We experimentally show across 7 databases and across different temporal derivations of the graph that postmortem analysis can be between 50 times and 880 times faster than streaming analysis.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129455021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Automatically Generating High-performance Matrix Multiplication Kernels on the Latest Sunway Processor 在最新的神威处理器上自动生成高性能矩阵乘法核

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3545008.3545031

Xiaohan Tao, Yuming Zhu, Bo-Hsuan Wang, Jinlong Xu, Jianmin Pang, Jie Zhao

引用次数: 0

Semi-Online Multi-Machine with Restart Scheduling for Integrated Edge and Cloud Computing Systems 集成边缘和云计算系统的半在线多机重启调度

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3545008.3545059

Liming Ge, Zizhao Wang, Wei Bao, Dong Yuan, N. H. Tran, B. Zhou, Albert Y. Zomaya

{"title":"Semi-Online Multi-Machine with Restart Scheduling for Integrated Edge and Cloud Computing Systems","authors":"Liming Ge, Zizhao Wang, Wei Bao, Dong Yuan, N. H. Tran, B. Zhou, Albert Y. Zomaya","doi":"10.1145/3545008.3545059","DOIUrl":"https://doi.org/10.1145/3545008.3545059","url":null,"abstract":"We study the multi-machine task scheduling problem in an integrated serverless edge and cloud computing system, where tasks can be scheduled locally on edge processors or offloaded to cloud servers, with the objective of minimizing the makespan, i.e., the total time to finish all tasks. The system is semi-online, where the edge processing delays of the tasks are known as priori, but the cloud processing delays remain unknown due to the uncertainty introduced by uploading and loading delay (loading the software environment). The problem is NP-hard in nature, and therefore we resort to approximation schemes and propose a novel algorithm named multi-machine with restart scheduling (MRS). MRS utilizes task restart, where a task that is cancelled will be restarted later when its processing time exceeds the threshold, and the threshold can be adaptively adjusted. We derive an competitive ratio for MRS so that its worst-case gap from the optimal solution is bounded. We also implement the MRS scheduler in a real-world system, which schedules a diverse set of Deep Neural Network (DNN) inference tasks. It shows that MRS achieves significant reduction in makespan compared to existing benchmark schemes.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124563428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Tensor-Accelerated Fourth-Order Epistasis Detection on GPUs gpu上张量加速的四阶上位检测

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3545008.3545066

Ricardo Nobre, A. Ilic, Sergio Santander-Jiménez, Leonel Sousa

{"title":"Tensor-Accelerated Fourth-Order Epistasis Detection on GPUs","authors":"Ricardo Nobre, A. Ilic, Sergio Santander-Jiménez, Leonel Sousa","doi":"10.1145/3545008.3545066","DOIUrl":"https://doi.org/10.1145/3545008.3545066","url":null,"abstract":"The improved accessibility of gene sequencing technologies has led to creation of huge datasets, i.e. patient records related to certain human diseases (phenotypes). Hence, deriving fast and accurate algorithms for efficiently processing these datasets is a paramount concern to enable some key healthcare scenarios, such as personalizing treatments, explaining the occurrence of and/or susceptibility to complex conditions and reducing the spread of infectious diseases. This is especially true for high-order epistasis detection, one of the most computationally challenging problems in bioinformatics, where associations between a given phenotype and single nucleotide polymorphisms (SNPs) of a population can often only be uncovered through evaluation of a large number of SNP combinations. To tackle this challenge, we propose a novel fourth-order epistasis detection algorithm that leverages tensor processing capabilities of two distinct accelerator architectures by efficiently mapping core computations related to processing quads of SNPs to binary tensor-accelerated matrix operations. Experimental results show that the proposed approach delivers very high performance even in single-GPU environments, e.g., 27.8 and 90.9 tera quads of SNPs per second, scaled to the sample size, were processed on Titan RTX (Turing) and A100 (Ampere) PCIe GPUs, respectively. Being the first approach that exploits tensor cores for accelerating searches with interaction order above three, the proposed method achieved a performance of up to 835.4 tera quads of SNPs per second on the 8-GPU HGX A100 server, which represents performance two or more orders of magnitude higher than that of related art.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"184 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116061943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Efficient Phase-Functioned Real-time Character Control in Mobile Games: A TVM Enabled Approach 手机游戏中的有效相位函数实时角色控制:TVM支持方法

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3545008.3545095

Haidong Lan, Wenxi Zhu, Du Wu, Qian Qiu, Honglin Zhu, Jingjing Zhao, Xinghui Fu, Liu Wei, Jintao Meng, Minwen Deng

引用次数: 0

Spread: Decentralized Model Aggregation for Scalable Federated Learning 扩展:用于可扩展联邦学习的分散模型聚合

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3545008.3545030

Chuang Hu, Huang Huang Liang, Xiao Han, Bo Liu, D. Cheng, Dan Wang

{"title":"Spread: Decentralized Model Aggregation for Scalable Federated Learning","authors":"Chuang Hu, Huang Huang Liang, Xiao Han, Bo Liu, D. Cheng, Dan Wang","doi":"10.1145/3545008.3545030","DOIUrl":"https://doi.org/10.1145/3545008.3545030","url":null,"abstract":"Federated learning (FL) is a new distributed machine learning paradigm that enables machine learning on edge devices. One unique feature of FL is that edge devices belong to individuals; and since they are not “owned” by the FL coordinator, but can be “federated” instead, there can potentially be a huge number of edge devices. In the current distributed ML architecture, the parameter server (PS) architecture, model aggregation is centralized. When facing a large number of edge devices, the centralized model aggregation becomes the bottleneck and fundamentally restricts system scalability. In this paper, we present Spread to decentralize model aggregation. Spread is a tiered architecture where nodes are organized into clusters so that model aggregation can be offloaded to certain edge devices. We design a Spread-based FL system: it employs a new algorithm for cluster construction and an adaptive algorithm that regulates, in runtime, inter-cluster model training and intra-cluster model training. We present an implementation of a functional system by extending the Federated Learning system. Our evaluation shows that Spread can resolve the bottleneck of centralized model aggregation. Spread yields an 8.05 × and a 5.58 × model training speedup as compared to existing FL systems supported by the PS and allReduce architecture.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"159 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116153046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Online Scheduling of Moldable Task Graphs under Common Speedup Models 常用加速模型下可塑任务图的在线调度

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3545008.3545049

A. Benoit, Lucas Perotin, Y. Robert, Hongyang Sun

引用次数: 2

Highly Parallel Linear Forest Extraction from a Weighted Graph on GPUs 基于gpu的加权图高度并行线性森林提取

Proceedings of the 51st International Conference on Parallel Processing Pub Date : 2022-08-29 DOI: 10.1145/3545008.3545035

Christopher J. Klein, R. Strzodka

引用次数: 0