{"title":"Mlog: Multi-log Write Buffer upon Ultra-fast SSD RAID","authors":"Shucheng Wang, Q. Cao, Ziyi Lu, Jie Yao","doi":"10.1145/3545008.3545034","DOIUrl":"https://doi.org/10.1145/3545008.3545034","url":null,"abstract":"Parity-based RAID suffering from partial-stripe write-penalty has to introduce write buffer to fast absorb and merge incoming writes, and then flush them to RAID array in batch. However, we experimentally observe that the popular buffering mechanism as Linux RAID journal and partial parity logging (PPL) becomes a bottleneck for ultra-fast SSD-based RAID, and we further uncover that the centralized log-buffer model is the prime cause. In this paper, we propose a highly-parallel multi-Log RAID write buffer, Mlog, employing a two-dimensional log data-layout and a parallel I/O processing model to fully exploit both intra-/inter-SSDs parallelism. Specifically, Mlog partitions the global log-buffer into a set of SSD-zones located within each SSDs in horizontal, and the SSD-zone further divided into sublogs in vertical. Each sublog adopts a fine-grained submission lock. A set of sublogs with the same logical offset across different SSDs are combined into a LogGroup. Moreover, Mlog presents a two-phase write allocation to hash an incoming request to a LogGroup, and then strategically writes it to a dedicated sublog, thus providing highly-parallel logging writes. Mlog schedules part of LogGroups to serve incoming requests while reclaiming the others in the background, unleashing the internal-parallelism of SSDs. Finally, Mlog generates parities for the buffered user data within a LogGroup, enhancing data reliability. We evaluate Mlog with a variety of benchmarks and real-world traces. Mlog is shown to consistently outperform Linux RAID journal and PPL by up to 122 × in the write throughput under intensive workloads.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127459646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi Resource Scheduling with Task Cloning in Heterogeneous Clusters","authors":"Huanle Xu, Yang Liu, W. Lau","doi":"10.1145/3545008.3545093","DOIUrl":"https://doi.org/10.1145/3545008.3545093","url":null,"abstract":"To mitigate the straggler effect, today’s systems and computing frameworks have adopted redundancy to launch extra copies for stragglers. Two limitations of the existing straggler-mitigation techniques, however, are that resource demand of tasks is only considered in the context of slots and, moreover, redundancy is seldom coordinated with job scheduling. To tackle these issues, in this paper, we present DollyMP, a job scheduler that addresses multi-resource scheduling with task cloning in heterogeneous clusters. DollyMP carefully combines SRPT (Shortest Remaining Processing Time) and SVF (Smallest Volume First) via knapsack optimization to schedule tasks with multi-resource demands and, in the meanwhile, dynamically launches task clones to yield a small job completion time. DollyMP is built on a strong mathematical foundation to guarantee near-optimal performance. The deployment of our Hadoop YARN prototype on a 30-node cluster demonstrates that DollyMP can reduce job response time by 50% under different cluster loads.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130446186","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Postmortem Computation of Pagerank on Temporal Graphs","authors":"M. Hossain, Erik Saule","doi":"10.1145/3545008.3545055","DOIUrl":"https://doi.org/10.1145/3545008.3545055","url":null,"abstract":"Temporal graphs capture changes in relational data over time and have been of increasing interest to data analysts. Most research focuses on streaming algorithms that incrementally update an analysis to account for the changes in the graph. However, one can also be interested in understanding the nature of changes in the graph over time. In such a case, they perform a postmortem analysis on different points in time where all the data known in advance We study in this paper a postmortem analysis of Pagerank over-time on graphs that are defined by temporal relational event databases. A relation between two entities at a particular point in time will form an edge between these two entities and that will remain in the graph for a fixed period of time. While one can reuse a streaming algorithm for that purpose, leveraging the availability of all the data from the beginning can be beneficial. Postmortem analysis enables encoding the temporal graph with a more efficient graph representation. Also, it provides an additional level of parallelism since one can not only parallelize within a particular timestamp but also across different timestamps. We will show that depending on the properties of the temporal data, either parallelization can be better, and in some cases, a combination of both approaches is preferable. We experimentally show across 7 databases and across different temporal derivations of the graph that postmortem analysis can be between 50 times and 880 times faster than streaming analysis.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129455021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Automatically Generating High-performance Matrix Multiplication Kernels on the Latest Sunway Processor","authors":"Xiaohan Tao, Yuming Zhu, Bo-Hsuan Wang, Jinlong Xu, Jianmin Pang, Jie Zhao","doi":"10.1145/3545008.3545031","DOIUrl":"https://doi.org/10.1145/3545008.3545031","url":null,"abstract":"We present an approach to the automatic generation of efficient matrix multiplication code on the latest Sunway processor, which will be employed by the next-generation machine of Sunway TaihuLight, one of the fastest supercomputers on earth. The method allows users to write simple C code and automatically generates high-performance matrix multiplication kernels. It uses polyhedral transformations to implement rapid compute decomposition, data exchanges across memory hierarchy and memory latency hiding. An assembly routine is finally integrated into the generated kernels. While achieving up to 90.14% of the theoretical peak performance, our method surpasses a highly tuned library by 9.44%. Compared with existing techniques, our approach reduces the software development life cycle to generate efficient matrix code from months to seconds. We also take into account batched matrix multiplication and some fusion patterns for deep learning (DL), outperforming the library-based implementations by 1.30 × and 1.67 ×.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117162745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Liming Ge, Zizhao Wang, Wei Bao, Dong Yuan, N. H. Tran, B. Zhou, Albert Y. Zomaya
{"title":"Semi-Online Multi-Machine with Restart Scheduling for Integrated Edge and Cloud Computing Systems","authors":"Liming Ge, Zizhao Wang, Wei Bao, Dong Yuan, N. H. Tran, B. Zhou, Albert Y. Zomaya","doi":"10.1145/3545008.3545059","DOIUrl":"https://doi.org/10.1145/3545008.3545059","url":null,"abstract":"We study the multi-machine task scheduling problem in an integrated serverless edge and cloud computing system, where tasks can be scheduled locally on edge processors or offloaded to cloud servers, with the objective of minimizing the makespan, i.e., the total time to finish all tasks. The system is semi-online, where the edge processing delays of the tasks are known as priori, but the cloud processing delays remain unknown due to the uncertainty introduced by uploading and loading delay (loading the software environment). The problem is NP-hard in nature, and therefore we resort to approximation schemes and propose a novel algorithm named multi-machine with restart scheduling (MRS). MRS utilizes task restart, where a task that is cancelled will be restarted later when its processing time exceeds the threshold, and the threshold can be adaptively adjusted. We derive an competitive ratio for MRS so that its worst-case gap from the optimal solution is bounded. We also implement the MRS scheduler in a real-world system, which schedules a diverse set of Deep Neural Network (DNN) inference tasks. It shows that MRS achieves significant reduction in makespan compared to existing benchmark schemes.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"70 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124563428","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ricardo Nobre, A. Ilic, Sergio Santander-Jiménez, Leonel Sousa
{"title":"Tensor-Accelerated Fourth-Order Epistasis Detection on GPUs","authors":"Ricardo Nobre, A. Ilic, Sergio Santander-Jiménez, Leonel Sousa","doi":"10.1145/3545008.3545066","DOIUrl":"https://doi.org/10.1145/3545008.3545066","url":null,"abstract":"The improved accessibility of gene sequencing technologies has led to creation of huge datasets, i.e. patient records related to certain human diseases (phenotypes). Hence, deriving fast and accurate algorithms for efficiently processing these datasets is a paramount concern to enable some key healthcare scenarios, such as personalizing treatments, explaining the occurrence of and/or susceptibility to complex conditions and reducing the spread of infectious diseases. This is especially true for high-order epistasis detection, one of the most computationally challenging problems in bioinformatics, where associations between a given phenotype and single nucleotide polymorphisms (SNPs) of a population can often only be uncovered through evaluation of a large number of SNP combinations. To tackle this challenge, we propose a novel fourth-order epistasis detection algorithm that leverages tensor processing capabilities of two distinct accelerator architectures by efficiently mapping core computations related to processing quads of SNPs to binary tensor-accelerated matrix operations. Experimental results show that the proposed approach delivers very high performance even in single-GPU environments, e.g., 27.8 and 90.9 tera quads of SNPs per second, scaled to the sample size, were processed on Titan RTX (Turing) and A100 (Ampere) PCIe GPUs, respectively. Being the first approach that exploits tensor cores for accelerating searches with interaction order above three, the proposed method achieved a performance of up to 835.4 tera quads of SNPs per second on the 8-GPU HGX A100 server, which represents performance two or more orders of magnitude higher than that of related art.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"184 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116061943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Haidong Lan, Wenxi Zhu, Du Wu, Qian Qiu, Honglin Zhu, Jingjing Zhao, Xinghui Fu, Liu Wei, Jintao Meng, Minwen Deng
{"title":"Efficient Phase-Functioned Real-time Character Control in Mobile Games: A TVM Enabled Approach","authors":"Haidong Lan, Wenxi Zhu, Du Wu, Qian Qiu, Honglin Zhu, Jingjing Zhao, Xinghui Fu, Liu Wei, Jintao Meng, Minwen Deng","doi":"10.1145/3545008.3545095","DOIUrl":"https://doi.org/10.1145/3545008.3545095","url":null,"abstract":"In this paper, we propose a highly efficient computing method for game character control with phase-functioned neural networks (PFNN). The primary challenge to accelerate PFNN on mobile platforms is that PFNN dynamically produces weight matrices with an argument, phase, which is individual to each game character. Therefore existing libraries that generally assume frozen weight matrices are inefficient to accelerate PFNN. The situation becomes even worse when multiple characters are present. To address the challenges, we reformulate the equations and leverage the deep learning compiler stack TVM to build a cross-platform, high-performance implementation. Evaluations reveal that our solutions deliver close-to-peak performance on various platforms, from high-performance servers to energy-efficient mobile platforms. This work is publicly available at https://github.com/turbo0628/pfnn_tvm.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129838996","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chuang Hu, Huang Huang Liang, Xiao Han, Bo Liu, D. Cheng, Dan Wang
{"title":"Spread: Decentralized Model Aggregation for Scalable Federated Learning","authors":"Chuang Hu, Huang Huang Liang, Xiao Han, Bo Liu, D. Cheng, Dan Wang","doi":"10.1145/3545008.3545030","DOIUrl":"https://doi.org/10.1145/3545008.3545030","url":null,"abstract":"Federated learning (FL) is a new distributed machine learning paradigm that enables machine learning on edge devices. One unique feature of FL is that edge devices belong to individuals; and since they are not “owned” by the FL coordinator, but can be “federated” instead, there can potentially be a huge number of edge devices. In the current distributed ML architecture, the parameter server (PS) architecture, model aggregation is centralized. When facing a large number of edge devices, the centralized model aggregation becomes the bottleneck and fundamentally restricts system scalability. In this paper, we present Spread to decentralize model aggregation. Spread is a tiered architecture where nodes are organized into clusters so that model aggregation can be offloaded to certain edge devices. We design a Spread-based FL system: it employs a new algorithm for cluster construction and an adaptive algorithm that regulates, in runtime, inter-cluster model training and intra-cluster model training. We present an implementation of a functional system by extending the Federated Learning system. Our evaluation shows that Spread can resolve the bottleneck of centralized model aggregation. Spread yields an 8.05 × and a 5.58 × model training speedup as compared to existing FL systems supported by the PS and allReduce architecture.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"159 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116153046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Online Scheduling of Moldable Task Graphs under Common Speedup Models","authors":"A. Benoit, Lucas Perotin, Y. Robert, Hongyang Sun","doi":"10.1145/3545008.3545049","DOIUrl":"https://doi.org/10.1145/3545008.3545049","url":null,"abstract":"The problem of scheduling moldable tasks on multiprocessor systems with the objective of minimizing the overall completion time (or makespan) has been widely studied, in particular when tasks have dependencies (i.e., task graphs), or when tasks are released on-the-fly (i.e., online). However, few studies have focused on both (i.e., online scheduling of moldable task graphs). In this paper, we design a new online algorithm and derive constant competitive ratios for this problem under several common yet realistic speedup models (i.e., roofline, communication, Amdahl, and a general combination). We also prove, for each model, a lower bound on the competitiveness of our algorithm, which is very close to the constant competitive ratio. Finally, we provide the first lower bound on the competitive ratio of any deterministic online algorithm for the arbitrary speedup model, which is not constant but depends on the number of tasks in the longest path of the graph.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"146 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116446602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Highly Parallel Linear Forest Extraction from a Weighted Graph on GPUs","authors":"Christopher J. Klein, R. Strzodka","doi":"10.1145/3545008.3545035","DOIUrl":"https://doi.org/10.1145/3545008.3545035","url":null,"abstract":"For graph matching, each vertex is allowed to match with exactly one other vertex, such that the spanning subgraph of the matching has a maximum degree of one, i.e., the subgraph is a [0,1]-factor. In this work, we provide a highly parallel algorithm to extract a spanning subgraph with a maximum degree of n (the subgraph is a [0,n]-factor) and demonstrate the efficiency of our GPU implementation for n=1,2,3,4 by expressing the algorithm in terms of generalized sparse matrix-vector products. Moreover, from the [0,2]-factor, we compute a maximum linear forest (union of disjoint paths) by breaking up cycles and permuting the subgraph with respect to the vertex order within the paths. Both tasks execute efficiently on the GPU because of our novel parallel scan implementation, which does not require a random access iterator. As an application of linear forests, we demonstrate the algebraic creation of enhanced tridiagonal preconditioners for various large matrices from the Sparse Matrix Collection and report runtimes in the order of milliseconds for graphs with millions of edges and vertices on an RTX 2080 Ti.","PeriodicalId":360504,"journal":{"name":"Proceedings of the 51st International Conference on Parallel Processing","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123904826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}