2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)最新文献_第4页

EC-Shuffle: Dynamic Erasure Coding Optimization for Efficient and Reliable Shuffle in Spark EC-Shuffle: Spark中高效可靠的动态擦除编码优化

2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) Pub Date : 2019-05-14 DOI: 10.1109/CCGRID.2019.00014

Xin Yao, Cho-Li Wang, Mingzhe Zhang

{"title":"EC-Shuffle: Dynamic Erasure Coding Optimization for Efficient and Reliable Shuffle in Spark","authors":"Xin Yao, Cho-Li Wang, Mingzhe Zhang","doi":"10.1109/CCGRID.2019.00014","DOIUrl":"https://doi.org/10.1109/CCGRID.2019.00014","url":null,"abstract":"Fault-tolerance capabilities attract increasing attention from existing data processing frameworks, such as Apache Spark. To avoid replaying costly distributed computation, like shuffle, local checkpoint and remote replication are two popular approaches. They incur significant runtime overhead, such as extra storage cost or network traffic. Erasure coding is another emerging technology, which also enables data resilience. It is perceived as capable of replacing the checkpoint and replication mechanisms for its high storage efficiency. However, it suffers heavy network traffic due to distributing data partitions to different locations. In this paper, we propose EC-Shuffle with two encoding schemes and optimize the shuffle-based operations in Spark or MapReduce-like frameworks. Specifically, our encoding schemes concentrate on optimizing the data traffic during the execution of shuffle operations. They only transfer the parity chunks generated via erasure coding, instead of a whole copy of all data chunks. EC-Shuffle also provides a strategy, which can dynamically select the per-shuffle biased encoding scheme according to the number of senders and receivers in each shuffle. Our analyses indicate that this dynamic encoding selection can minimize the total size of parity chunks. The extensive experimental results using BigDataBench with hundreds of mappers and reducers shows this optimization can reduce up to 50% network traffic and achieve up to 38% performance improvement.","PeriodicalId":234571,"journal":{"name":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129306112","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Exhaustive Study of Hierarchical AllReduce Patterns for Large Messages Between GPUs gpu间大消息的分层AllReduce模式详尽研究

2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) Pub Date : 2019-05-14 DOI: 10.1109/CCGRID.2019.00057

Yuichiro Ueno, Rio Yokota

{"title":"Exhaustive Study of Hierarchical AllReduce Patterns for Large Messages Between GPUs","authors":"Yuichiro Ueno, Rio Yokota","doi":"10.1109/CCGRID.2019.00057","DOIUrl":"https://doi.org/10.1109/CCGRID.2019.00057","url":null,"abstract":"Data-parallel distributed deep learning requires an AllReduce operation between all GPUs with message sizes in the order of hundreds of megabytes. The popular implementation of AllReduce for deep learning is the Ring-AllReduce, but this method suffers from latency when using thousands of GPUs. There have been efforts to reduce this latency by combining the ring with more latency-optimal hierarchical methods. In the present work, we consider these hierarchical communication methods as a general hierarchical Ring-AllReduce with a pure Ring-AllReduce on one end and Rabenseifner's algorithm on the other end of the spectrum. We exhaustively test the various combinations of hierarchical partitioning of processes on the ABCI system in Japan on up to 2048 GPUs. We develop a performance model for this generalized hierarchical Ring-AllReduce and show the lower-bound of the effective bandwidth achievable for the hierarchical NCCL communication on thousands of GPUs. Our measurements agree well with our performance model. We also find that the optimal large-scale process hierarchy contains the optimal small-scale process hierarchy so the search space for the optimal communication will be reduced.","PeriodicalId":234571,"journal":{"name":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132430250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

A Preliminary Fault Taxonomy for Multi-tenant SaaS Systems 多租户SaaS系统的初步故障分类

2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) Pub Date : 2019-05-14 DOI: 10.1109/CCGRID.2019.00032

V. H. S. C. Pinto, S. Souza, P. L. D. Souza

{"title":"A Preliminary Fault Taxonomy for Multi-tenant SaaS Systems","authors":"V. H. S. C. Pinto, S. Souza, P. L. D. Souza","doi":"10.1109/CCGRID.2019.00032","DOIUrl":"https://doi.org/10.1109/CCGRID.2019.00032","url":null,"abstract":"Multi-tenancy is the key feature for every Software as a Service (SaaS), as it enables multiple customers, so-called tenants, to transparently share a system's resources reducing costs. Tenants can customize a system according to their particular needs, however, such a high level of complexity may open possibilities for a failure. In addition, there is a lack of a reference architecture for such applications and once the implementations differ significantly, ensuring that all executions flows have been verified without impacting the working features for other tenants is a complex task. The clear understanding of the possible faults is fundamental for the identification, tolerance and definition of appropriate testing techniques. This paper presents a preliminary fault taxonomy for multi-tenant cloud applications considering their foundational features. A literature review previously carried out, a survey with practitioners and analysis of some applications were performed to achieve this classification. In addition, an e-commerce called MtShop was developed for a case study. The expressiveness of the proposed taxonomy is illustrated with critical faults identified in the MtShop through the automated and parallel testing. We conclude with the benefits that our taxonomy can bring to testing, prediction and regression testing activity of multi-tenant cloud applications.","PeriodicalId":234571,"journal":{"name":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130772008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A Performance Improvement Approach for Second-Order Optimization in Large Mini-batch Training 大型小批量训练中二阶优化的性能改进方法

2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) Pub Date : 2019-05-14 DOI: 10.1109/CCGRID.2019.00092

Hiroki Naganuma, Rio Yokota

{"title":"A Performance Improvement Approach for Second-Order Optimization in Large Mini-batch Training","authors":"Hiroki Naganuma, Rio Yokota","doi":"10.1109/CCGRID.2019.00092","DOIUrl":"https://doi.org/10.1109/CCGRID.2019.00092","url":null,"abstract":"Classical learning theory states that when the number of parameters of the model is too large compared to the data, the model will overfit and the generalization performance deteriorates. However, it has been empirically shown that deep neural networks (DNN) can achieve high generalization capability by training with extremely large amount of data and model parameters, which exceeds the predictions of classical learning theory. One drawback of this is that training of DNN requires enormous calculation time. Therefore, it is necessary to reduce the training time through large scale parallelization. Straightforward data-parallelization of DNN degrades convergence and generalization. In the present work, we investigate the possibility of using second order methods to solve this generalization gap in large-batch training. This is motivated by our observation that each mini-batch becomes more statistically stable, and thus the effect of considering the curvature plays a more important role in large-batch training. We have also found that naively adapting the natural gradient method causes the generalization performance to deteriorate further due to the lack of regularization capability. We propose an improved second order method by smoothing the loss function, which allows second-order methods to generalize as well as mini-batch SGD.","PeriodicalId":234571,"journal":{"name":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124273819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

A Zoom-in Analysis of I/O Logs to Detect Root Causes of I/O Performance Bottlenecks I/O日志放大分析，找出I/O性能瓶颈的根本原因

2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) Pub Date : 2019-05-14 DOI: 10.1109/CCGRID.2019.00021

Teng Wang, S. Byna, Glenn K. Lockwood, S. Snyder, P. Carns, Sunggon Kim, N. Wright

引用次数: 14

CRAM: a Container Resource Allocation Mechanism for Big Data Streaming Applications CRAM:大数据流应用的容器资源分配机制

2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) Pub Date : 2019-05-14 DOI: 10.1109/CCGRID.2019.00045

Olubisi Runsewe, N. Samaan

引用次数: 6

Towards Enabling Dynamic Resource Estimation and Correction for Improving Utilization in an Apache Mesos Cloud Environment 在Apache Mesos云环境中启用动态资源估计和校正以提高利用率

2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) Pub Date : 2019-05-14 DOI: 10.1109/CCGRID.2019.00033

Gourav Rattihalli, M. Govindaraju, Devesh Tiwari

{"title":"Towards Enabling Dynamic Resource Estimation and Correction for Improving Utilization in an Apache Mesos Cloud Environment","authors":"Gourav Rattihalli, M. Govindaraju, Devesh Tiwari","doi":"10.1109/CCGRID.2019.00033","DOIUrl":"https://doi.org/10.1109/CCGRID.2019.00033","url":null,"abstract":"Academic cloud infrastructures require users to specify an estimate of their resource requirements. The resource usage for applications often depends on the input file sizes, parameters, optimization flags, and attributes, specified for each run. Incorrect estimation can result in low resource utilization of the entire infrastructure and long wait times for jobs in the queue. We have designed a Resource Utilization based Migration (RUMIG) system to address the resource estimation problem. We present the overall architecture of the two-stage elastic cluster design, the Apache Mesos-specific container migration system, and analyze the performance for several scientific workloads on three different cloud/cluster environments. In this paper we (b) present a design and implementation for container migration in a Mesos environment, (c) evaluate the effect of right-sizing and cluster elasticity on overall performance, (d) analyze different profiling intervals to determine the best fit, (e) determine the overhead of our profiling mechanism. Compared to the default use of Apache Mesos, in the best cases, RUMIG provides a gain of 65% in runtime (local cluster), 51% in CPU utilization in the Chameleon cloud, and 27% in memory utilization in the Jetstream cloud.","PeriodicalId":234571,"journal":{"name":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124997795","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Effective and Efficient Big Data Management in Distributed Environments: Models, Issues, and Research Perspectives 分布式环境中有效和高效的大数据管理:模型、问题和研究视角

2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) Pub Date : 2019-05-14 DOI: 10.1109/CCGRID.2019.00071

A. Cuzzocrea

引用次数: 1

Mobile Smart-Contract Lifecycle Governance with Incentivized Proof-of-Stake for Oligopoly-Formation Prevention 防止寡头垄断形成的激励权益证明移动智能合约生命周期治理

2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) Pub Date : 2019-05-14 DOI: 10.1109/CCGRID.2019.00029

Vipin Deval, A. Norta

{"title":"Mobile Smart-Contract Lifecycle Governance with Incentivized Proof-of-Stake for Oligopoly-Formation Prevention","authors":"Vipin Deval, A. Norta","doi":"10.1109/CCGRID.2019.00029","DOIUrl":"https://doi.org/10.1109/CCGRID.2019.00029","url":null,"abstract":"Permissionless blockchain-enabled smart contracts execute code in a distributed peer-to-peer network system and thereby overcome undesirable effects of system centralization. Smart contracts that use proof-of-stake (PoS) algorithms for the validation of transactions have advantages over proof-of-work (PoW) in that they use less electricity and perform faster. The disadvantage of PoS algorithms is the issue of nothing to stake and the emergence of staking oligopolies. Thus, significant stakeholders might be able to create an oligopoly as miners with significant stakes have the chance to validate the transaction in a dominant position. In current smart contracts, the adoption of mobile devices is another emerging trend to manage mobile smart contracts. The advantage is spreading of a democratization effect as a large number of stakers participate in transaction validation and thereby reduce the risk of oligopolies. In our work, we aim to improve the PoS algorithm to reduce oligopoly formation in smart contracts by addressing the need for creating mobile smart contracts that are governed by a mobile lifecycle management. Additionally, we enhance the scalability and performance of smart contracts by focusing specifically on ways to incentivize PoS algorithms.","PeriodicalId":234571,"journal":{"name":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128341054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Autotuning Under Tight Budget Constraints: A Transparent Design of Experiments Approach 严格预算约束下的自动调谐:实验方法的透明设计

2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID) Pub Date : 2019-05-14 DOI: 10.1109/CCGRID.2019.00026

P. Bruel, S. Masnada, B. Videau, Arnaud Legrand, J. Vincent, A. Goldman

{"title":"Autotuning Under Tight Budget Constraints: A Transparent Design of Experiments Approach","authors":"P. Bruel, S. Masnada, B. Videau, Arnaud Legrand, J. Vincent, A. Goldman","doi":"10.1109/CCGRID.2019.00026","DOIUrl":"https://doi.org/10.1109/CCGRID.2019.00026","url":null,"abstract":"A large amount of resources is spent writing, porting, and optimizing scientific and industrial High Performance Computing applications, which makes autotuning techniques fundamental to lower the cost of leveraging the improvements on execution time and power consumption provided by the latest software and hardware platforms. Despite the need for economy, most autotuning techniques still require large budgets of costly experimental measurements to provide good results, while rarely providing exploitable knowledge after optimization. The contribution of this paper is a user-transparent autotuning technique based on Design of Experiments that operates under tight budget constraints by significantly reducing the measurements needed to find good optimizations. Our approach enables users to make informed decisions on which optimizations to pursue and when to stop. We present an experimental evaluation of our approach and show it is capable of leveraging user decisions to find the best global configuration of a GPU Laplacian kernel using half of the measurement budget used by other common autotuning techniques. We show that our approach is also capable of finding speedups of up to 50x, compared to gcc's -O3, for some kernels from the SPAPT benchmark suite, using up to 10x fewer measurements than random sampling.","PeriodicalId":234571,"journal":{"name":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129173661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5