{"title":"Detection of Silent Data Corruptions in Smoothed Particle Hydrodynamics Simulations","authors":"Aurélien Cavelan, R. Cabezón, F. Ciorba","doi":"10.1109/CCGRID.2019.00013","DOIUrl":"https://doi.org/10.1109/CCGRID.2019.00013","url":null,"abstract":"Silent data corruptions (SDCs) hinder the correctness of long-running scientific applications on large scale computing systems. Selective particle replication (SPR) is proposed herein as the first particle-based replication method for detecting SDCs in Smoothed particle hydrodynamics (SPH) simulations. SPH is a mesh-free Lagrangian method commonly used to perform hydrodynamical simulations in astrophysics and computational fluid dynamics. SPH performs interpolation of physical properties over neighboring discretization points (called SPH particles) that dynamically adapt their distribution to the mass density field of the fluid. When a fault (e.g., a bit-flip) strikes the computation or the data associated with a particle, the resulting error is silently propagated to all nearest neighbors through such interpolation steps. SPR replicates the computation and data of a few carefully selected SPH particles. SDCs are detected when the data of a particle differs, due to corruption, from its replicated counterpart. SPR is able to detect many DRAM SDCs as they propagate by ensuring that all particles have at least one neighbor that is replicated. The detection capabilities of SPR were assessed through a set of error-injection and detection experiments and the overhead of SPR was evaluated via a set of strong-scaling experiments conducted on a HPC system. The results show that SPR achieves detection rates of 91-99.9%, no false-positives, at an overhead of 1-10%.","PeriodicalId":234571,"journal":{"name":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128281510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yong Liu, Pavel A. Dmitriev, Yifei Huang, Andrew Brooks, Li Dong
{"title":"An Evaluation of Transfer Learning for Classifying Sales Engagement Emails at Large Scale","authors":"Yong Liu, Pavel A. Dmitriev, Yifei Huang, Andrew Brooks, Li Dong","doi":"10.1109/CCGRID.2019.00069","DOIUrl":"https://doi.org/10.1109/CCGRID.2019.00069","url":null,"abstract":"This paper conducts an empirical investigation to evaluate transfer learning for classifying sales engagement emails arising from digital sales engagement platforms. Given the complexity of content and context of sales engagement, lack of standardized large corpora and benchmarks, limited labeled examples and heterogenous context of intent, this real-world use case poses both a challenge and an opportunity for adopting a transfer learning approach. We propose an evaluation framework to assess a high performance transfer learning (HPTL) approach in three key areas in addition to commonly used accuracy metrics: 1) effective embeddings and pretrained language model usage, 2) minimum labeled samples requirement and 3) transfer learning implementation strategies. We use in-house sales engagement email samples as the experiment dataset, which includes over 3000 emails labeled as positive, objection, unsubscribe, or not-sure. We discuss our findings on evaluating BERT, ELMo, Flair and GloVe embeddings with both feature-based and fine-tuning approaches and their scalability on a GPU cluster with increasingly larger labeled samples. Our results show that fine-tuning of the BERT model outperforms with as few as 300 labeled samples, but underperforms with fewer than 300 labeled samples, relative to all the feature-based approaches using different embeddings.","PeriodicalId":234571,"journal":{"name":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"111 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133433871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Same Same, but Different: A Descriptive Intra-IaaS Differentiation","authors":"Yehia El-khatib, F. Samreen, G. Blair","doi":"10.1109/CCGRID.2019.00089","DOIUrl":"https://doi.org/10.1109/CCGRID.2019.00089","url":null,"abstract":"Users of cloud computing are overwhelmed with choice, even within the services offered by one provider. As such, many users select cloud services based on description alone. In this quantitative study, we investigate the services of 2 of major IaaS providers. We use 2 representative applications to obtain longitudinal observations over 7 days of the week and over different times of the day, totalling over 14,000 executions. We give evidence of significant variations of performance offered within IaaS services, calling for data-driven brokers that are able to offer automated and adaptive decision making processes with means for incorporating expressive user constraints.","PeriodicalId":234571,"journal":{"name":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115618059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Abdessalam Elhabbash, Yehia El-khatib, G. Blair, Yuhui Lin, A. Barker
{"title":"A Framework for SLO-driven Cloud Specification and Brokerage","authors":"Abdessalam Elhabbash, Yehia El-khatib, G. Blair, Yuhui Lin, A. Barker","doi":"10.1109/CCGRID.2019.00085","DOIUrl":"https://doi.org/10.1109/CCGRID.2019.00085","url":null,"abstract":"The diversity of cloud offerings motivated the proposition of cloud modelling languages (CMLs) to abstract complexities related to selection of cloud services. However, current CMLs lack the support for modelling service level objectives (SLOs) that are required for the customer applications. Consequently, we propose an application- and provider-independent SLO modelling language (SLO-ML) to enable customers to specify the required SLOs. We also sketch the architecture to realise SLO-ML.","PeriodicalId":234571,"journal":{"name":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128279740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yusuke Nagasaka, Akira Nukada, Ryosuke Kojima, S. Matsuoka
{"title":"Batched Sparse Matrix Multiplication for Accelerating Graph Convolutional Networks","authors":"Yusuke Nagasaka, Akira Nukada, Ryosuke Kojima, S. Matsuoka","doi":"10.1109/CCGRID.2019.00037","DOIUrl":"https://doi.org/10.1109/CCGRID.2019.00037","url":null,"abstract":"Graph Convolutional Networks (GCNs) are recently getting much attention in bioinformatics and chemoinformatics as a state-of-the-art machine learning approach with high accuracy. GCNs process convolutional operations along with graph structures, and GPUs are used to process enormous operations including sparse-dense matrix multiplication (SpMM) when the graph structure is expressed as an adjacency matrix with sparse matrix format. However, the SpMM operation on small graph, where the number of nodes is tens or hundreds, hardly exploits high parallelism or compute power of GPU. Therefore, SpMM becomes a bottleneck of training and inference in GCNs applications. In order to improve the performance of GCNs applications, we propose new SpMM algorithm especially for small sparse matrix and Batched SpMM, which exploits high parallelism of GPU by processing multiple SpMM operations with single CUDA kernel. To the best of our knowledge, this is the first work of batched approach for SpMM. We evaluated the performance of the GCNs application on TSUBAME3.0 implementing NVIDIA Tesla P100 GPU, and our batched approach shows significant speedups of up to 1.59x and 1.37x in training and inference, respectively.","PeriodicalId":234571,"journal":{"name":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123795895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ben Blamey, Fredrik Wrede, Johan Karlsson, A. Hellander, S. Toor
{"title":"Adapting the Secretary Hiring Problem for Optimal Hot-Cold Tier Placement Under Top-K Workloads","authors":"Ben Blamey, Fredrik Wrede, Johan Karlsson, A. Hellander, S. Toor","doi":"10.1109/CCGRID.2019.00074","DOIUrl":"https://doi.org/10.1109/CCGRID.2019.00074","url":null,"abstract":"Top-K queries are an established heuristic in information retrieval. This paper presents an approach for optimal tiered storage allocation under stream processing workloads using this heuristic: those requiring the analysis of only the top-K ranked most relevant documents from a fixed-length stream, stream window, or batch job. Documents are ranked for relevance on a user-specified interestingness function, the top-K stored for further processing. This scenario bears similarity to the classic Secretary Hiring Problem (SHP), and the expected rate of document writes and document lifetime can be modelled as a function of document index. We present parameter-based algorithms for storage tier placement, minimizing document storage and transport costs. We derive expressions for optimal parameter values in terms of tier storage and transport costs a priori, without needing to monitor the application. This contrasts with (often complex) existing work on tiered storage optimization, which is either tightly coupled to specific use cases, or requires active monitoring of application IO load – ill-suited to long-running or one-off operations common in the scientific computing domain. We motivate and evaluate our model with a trace-driven simulation of human-in-the-loop bio-chemical model exploration, and two cloud storage case studies.","PeriodicalId":234571,"journal":{"name":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128019281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance Evaluation of Big Data Processing Strategies for Neuroimaging","authors":"Valérie Hayot-Sasson, Shawn T. Brown, T. Glatard","doi":"10.1109/CCGRID.2019.00059","DOIUrl":"https://doi.org/10.1109/CCGRID.2019.00059","url":null,"abstract":"Neuroimaging datasets are rapidly growing in size as a result of advancements in image acquisition methods, open-science and data sharing. However, the adoption of Big Data processing strategies by neuroimaging processing engines remains limited. Here, we evaluate three Big Data processing strategies (in-memory computing, data locality and lazy evaluation) on typical neuroimaging use cases, represented by the BigBrain dataset. We contrast these various strategies using Apache Spark and Nipype as our representative Big Data and neuroimaging processing engines, on Dell EMC's Top-500 cluster. Big Data thresholds were modeled by comparing the data-write rate of the application to the filesystem bandwidth and number of concurrent processes. This model acknowledges the fact that page caching provided by the Linux kernel is critical to the performance of Big Data applications. Results show that in-memory computing alone speeds-up executions by a factor of up to 1.6, whereas when combined with data locality, this factor reaches 5.3. Lazy evaluation strategies were found to increase the likelihood of cache hits, further improving processing time. Such important speed-up values are likely to be observed on typical image processing operations performed on images of size larger than 75GB. A ballpark speculation from our model showed that in-memory computing alone will not speed-up current functional MRI analyses unless coupled with data locality and processing around 280 subjects concurrently. Furthermore, we observe that emulating in-memory computing using in-memory file systems (tmpfs) does not reach the performance of an in-memory engine, presumably due to swapping to disk and the lack of data cleanup. We conclude that Big Data processing strategies are worth developing for neuroimaging applications.","PeriodicalId":234571,"journal":{"name":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"75 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133403602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Awan, J. Bédorf, Ching-Hsiang Chu, H. Subramoni, D. Panda
{"title":"Scalable Distributed DNN Training using TensorFlow and CUDA-Aware MPI: Characterization, Designs, and Performance Evaluation","authors":"A. Awan, J. Bédorf, Ching-Hsiang Chu, H. Subramoni, D. Panda","doi":"10.1109/CCGRID.2019.00064","DOIUrl":"https://doi.org/10.1109/CCGRID.2019.00064","url":null,"abstract":"The current wave of advances in Deep Learning (DL) have been triggered by the availability of large-scale datasets, efficient CPU and GPU hardware, and development of software frameworks like TensorFlow (TF). However, little exists in literature that addresses TensorFlow's distributed training capabilities. In this paper, we provide an in-depth performance characterization and design analysis for distributed TensorFlow. We present three key insights: 1) Horovod designs achieve better performance compared to the official gRPC-based approaches, 2) performance of Horovod design is heavily influenced by the time spent in gradient aggregation that uses the Allreduce primitive, and 3) performance of existing Horovod-MPI implementation is significantly worse compared to Horovod-NCCL. To address this limitation in Horovod-MPI, we propose a novel and efficient CUDA-Aware MPI Allreduce design that 1) exploits CUDA kernels to perform large reductions on the GPU, 2) uses a com-bination of bandwidth-optimal and latency-optimal algorithms, and 3) maintains a pointer cache to avoid CUDA-driver query overheads in the critical path. The proposed designs deliver 5×, 17×, and 29% better performance compared to NCCL2 for small, medium, and large messages. Our designs enable Horovod-MPI to beat state-of-the-art Horovod-NCCL2 by 3% and achieve 90% scaling efficiency for ResNet-50 training on 64 Pascal GPUs.","PeriodicalId":234571,"journal":{"name":"2019 19th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114331337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}