{"title":"Workflow Simulation Aware and Multi-threading Effective Task Scheduling for Heterogeneous Computing","authors":"Vasilios I. Kelefouras, K. Djemame","doi":"10.1109/HiPC.2018.00032","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00032","url":null,"abstract":"Efficient application scheduling is critical for achieving high performance in heterogeneous computing systems. This problem has proved to be NP-complete, heading research efforts in obtaining low complexity heuristics that produce good quality schedules. Although this problem has been extensively studied in the past, all the related works assume the computation costs of application tasks on processors are available a priori, ignoring the fact that the time needed to run/simulate all these tasks is orders of magnitude higher than finding a good quality schedule, especially in heterogeneous systems. In this paper, we propose two new methods applicable to several task scheduling algorithms for heterogeneous computing systems. We showcase both methods by using HEFT well known and popular algorithm, but they are applicable to other algorithms too, such as HCPT, HPS, PETS and CPOP. First, we propose a methodology to reduce the scheduling time of HEFT when the computation costs are unknown, without sacrificing the length of the output schedule (monotonic computation costs); this is achieved by reducing the number of computation costs required by HEFT and as a consequence the number of simulations applied. Second, we give heuristics to find which tasks are going to be executed as Single-Thread and which as Multi-Thread CPU implementations, as well as the number of the threads used. The experimental results considering both random graphs and real world applications show that extending HEFT with the two proposed methods achieves better schedule lengths, while at the same time requires from 4.5 up to 24 less simulations.","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126476020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Decentralized Privacy-Preserving Timed Execution in Blockchain-Based Smart Contract Platforms","authors":"Chao Li, Balaji Palanisamy","doi":"10.1109/HiPC.2018.00037","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00037","url":null,"abstract":"Timed transaction execution is critical for various decentralized privacy-preserving applications powered by blockchain-based smart contract platforms. Such privacy-preserving smart contract applications need to be able to securely maintain users' sensitive inputs off the blockchain until a prescribed execution time and then automatically make the inputs available to enable on-chain execution of the target function at the execution time, even if the user goes offline. While straight-forward centralized approaches provide a basic solution to the problem, unfortunately they are limited to a single point of trust. This paper presents a new decentralized privacy-preserving transaction scheduling approach that allows users of Ethereum-based decentralized applications to schedule transactions without revealing sensitive inputs before an execution time window selected by the users. The proposed approach involves no centralized party and allows users to go offline at their discretion after scheduling a transaction. The sensitive inputs are privately maintained by a set of trustees randomly selected from the network enabling the inputs to be revealed only at the execution time. The proposed protocol employs secret key sharing and layered encryption techniques and economic deterrence models to securely protect the sensitive information against possible attacks including some trustees destroying the sensitive information or secretly releasing the sensitive information prior to the execution time. We demonstrate the attack-resilience of the proposed approach through rigorous analysis. Our implementation and experimental evaluation on the Ethereum official test network demonstrates that the proposed approach is effective and has a low gas cost and time overhead associated with it.","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114674071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Sampled Dense Matrix Multiplication for High-Performance Machine Learning","authors":"Israt Nisa, Aravind Sukumaran-Rajam, Süreyya Emre Kurt, Changwan Hong, P. Sadayappan","doi":"10.1109/HiPC.2018.00013","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00013","url":null,"abstract":"Many machine learning methods involve iterative optimization and are amenable to a variety of alternate formulations. Many currently popular formulations for some machine learning methods based on core operations that essentially correspond to sparse matrix-vector products. A reformulation using sparse matrix-matrix products primitives can potentially enable significant performance enhancement. Sampled Dense-Dense Matrix Multiplication (SDDMM) is a primitive that has been shown to be usable as a core component in reformulations of many machine learning factor analysis algorithms such as Alternating Least Squares (ALS), Latent Dirichlet Allocation (LDA), Sparse Factor Analysis (SFA), and Gamma Poisson (GaP). It requires the computation of the product of two input dense matrices but only at locations of the result matrix corresponding to nonzero entries in a sparse third input matrix. In this paper, we address the development of cuSDDMM, a multi-node GPU-accelerated implementation for SDDMM. We analyze the data reuse characteristics of SDDMM and develop a model-driven strategy for choice of tiling permutation and tile-size choice. cuSDDMM improves significantly (up to 4.6x) over the best currently available GPU implementation of SDDMM (in the BIDMach Machine Learning library).","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125261044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Damon Fenacci, H. Vandierendonck, Dimitrios S. Nikolopoulos
{"title":"Code and Data Transformations to Address Garbage Collector Performance in Big Data Processing","authors":"Damon Fenacci, H. Vandierendonck, Dimitrios S. Nikolopoulos","doi":"10.1109/HiPC.2018.00040","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00040","url":null,"abstract":"Java, with its dynamic runtime environment and garbage collected GC memory management, is a very popular choice for big data processing engines. Its runtime provides convenient mechanisms to implement workload distribution without having to worry about direct memory allocation and deallocation. However, efficient memory usage is a recurring issue. In particular, our evaluation shows that garbage collection has huge drawbacks when handling a large number of data objects and more than 60% of execution time can be consumed by garbage collection. We present a set of unconventional strategies to counter this issue that rely on data layout transformations to drastically reduce the number of objects, and on changing the code structure to reduce the lifetime of objects. We encapsulate the implementation in Apache Spark making it transparent for software developers. Our preliminary results show an average speedup of 1.54 and a highest of 8.23 over a range of applications, datasets and GC types. In practice, this can provide a substantial reduction in execution time or allow a sizeable reduction in the amount of compute power needed for the same task.","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"241 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115587534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Balancing Stragglers Against Staleness in Distributed Deep Learning","authors":"Saurav Basu, Vaibhav Saxena, Rintu Panja, Ashish Verma","doi":"10.1109/HiPC.2018.00011","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00011","url":null,"abstract":"Synchronous SGD is frequently the algorithm of choice for training deep learning models on compute clusters within reasonable time frames. However, even if a large number of workers (CPUs or GPUs) are at disposal for training, hetero-geneity of compute nodes and unreliability of the interconnecting network frequently pose a bottleneck to the training speed. Since the workers have to wait for each other at every model update step, even a single straggler/slow worker can derail the whole training performance. In this paper, we propose a novel approach to mitigate the straggler problem in large compute clusters. We cluster the compute nodes into multiple groups where each group updates the model synchronously stored in its own parameter server. The parameter servers of the different groups update the model in a central parameter server in an asynchronous manner. Few stragglers in the same group (or even separate groups) have little effect on the computational performance. The staleness of the asynchronous updates can be controlled by limiting the number of groups. Our method, in essence, provides a mechanism to move seamlessly between a pure synchronous and a pure asynchronous setting, thereby balancing between the computational overhead of synchronous SGD and the accuracy degradation of a pure asynchronous SGD. We empirically show that with increasing delay from straggler nodes (more than 300% delay in a node), progressive grouping of available workers still finishes the training within 20% of the no-delay case, with the limit to the number of groups governed by the permissible degradation in accuracy (≤ 2.5% compared to the no-delay case).","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124514961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Firoz, Marcin Zalewski, Joshua D. Suetterlein, A. Lumsdaine
{"title":"Adaptive Runtime Features for Distributed Graph Algorithms","authors":"J. Firoz, Marcin Zalewski, Joshua D. Suetterlein, A. Lumsdaine","doi":"10.1109/HiPC.2018.00018","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00018","url":null,"abstract":"Performance of distributed graph algorithms can benefit greatly by forming rapport between algorithmic abstraction and the underlying runtime system that is responsible for scheduling work and exchanging messages. However, due to their dynamic and irregular nature of computation, distributed graph algorithms written in different programming models impose varying degrees of workload pressure on the runtime. To cope with such vastly different workload characteristics, a runtime has to make several trade-offs. One such trade-off arises, for example, when the runtime scheduler has to choose among alternatives such as whether to execute algorithmic work, or progress the network by probing network buffers, or throttle sending messages (termed flow control). This trade-off decides between optimizing the throughput of a runtime scheduler by increasing the rate of execution of algorithmic work, and reducing the latency of the network messages. Another trade-off exists when a decision has to be made about when to send aggregated messages in buffers (message coalescing). This decision chooses between trading off latency for network bandwidth and vice versa. At any instant, such trade-offs emphasize either on improving the quantity of work being executed (by maximizing the scheduler throughput) or on improving the quality of work (by prioritizing better work). However, encoding static policies for different runtime features (such as flow control, coalescing) can prevent graph algorithms from achieving their full potentials, thus can under-mine the actual performance of a distributed graph algorithm . In this paper, we investigate runtime support for distributed graph algorithms in the context of two paradigms: variants of well-known Bulk-Synchronous Parallel model and asynchronous programming model. We explore generic runtime features such as message coalescing (aggregation) and flow control and show that execution policies of these features need to be adjusted over time to make a positive impact on the execution time of a distributed graph algorithm. Since synchronous and asynchronous graph algorithms have different workload characteristics, not all of such runtime features may be good candidates for adaptation. Each of these algorithmic paradigms may require different set of features to be adapted over time. We demonstrate which set of feature(s) can be useful in each case to achieve the right balance of work in the runtime layer. Existing implementation of different graph algorithms can benefit from adapting dynamic policies in the underlying runtime.","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133828206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
M. E. Ozturk, Marissa Renardy, Yukun Li, G. Agrawal, Ching-Shan Chou
{"title":"A Novel Approach for Handling Soft Error in Conjugate Gradients","authors":"M. E. Ozturk, Marissa Renardy, Yukun Li, G. Agrawal, Ching-Shan Chou","doi":"10.1109/HiPC.2018.00030","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00030","url":null,"abstract":"Soft errors or bit flips have recently become an important challenge in high performance computing. In this paper, we focus on soft errors in a particular algorithm: conjugate gradients (CG). We present a series of techniques to detect soft errors in CG. We first derive a mathematical quantity that is monotonically decreasing. Next, we add a set of heuristics and combine our approach with previously established methods. We have extensively evaluated our method considering three distinct dimensions. First, we show that the F-score of our detection is significantly better than two other methods. Second, we show that for soft errors that are not detected by our method, the resulting inaccuracy in the final results are small, and better than those with other methods. Finally, we show that the runtime overheads of our method are lower than for other methods.","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"2016 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127484153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Performance Prediction Framework for Irregular Applications","authors":"Gangyi Zhu, G. Agrawal","doi":"10.1109/HiPC.2018.00042","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00042","url":null,"abstract":"Predicting performance of applications is an important requirement for many goals – choosing future procurements or upgrades, selecting specific optimization/implementation, requesting and allocating resources, and others. Irregular access patterns, commonly seen in many compute-intensive and data-intensive applications, pose many challenges in estimating overall execution time of applications, including, but not limit to, cache behavior. While much work exists on analysis of cache behavior with regular accesses, relatively little attention has been paid to irregular codes. In this paper, we aim to predict execution time of irregular applications on different hardware configurations, with emphasis on analyzing cache behavior with varying size of the cache and the number of nodes. Cache performance of irregular computations is highly input-dependent. Based on the sparse matrix view of irregular computation as well as the cache locality analysis, we propose a novel sampling approach named Adaptive Stratified Row sampling – this method is capable of generating a representative sample that delivers cache performance similar to the original input. On top of our sampling method, we incorporate reuse distance analysis to accommodate different cache configurations with high efficiency. Besides, we modify SKOPE, a code skeleton framework, to predict the execution time for irregular applications with the predicted cache performance. The results show that our approaches keep average error rates under 6% in predicting L1 cache miss rate for different cache configurations. The average error rates of predicting execution time for sequential and parallel scenarios are under 5% and 15%, respectively.","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127672638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scalable Proximity-Based Methods for Large-Scale Analysis of Atom Probe Data","authors":"Hao Lu, S. Seal, J. Poplawsky","doi":"10.1109/HiPC.2018.00034","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00034","url":null,"abstract":"Powered by recent advances in data acquisition technologies, today's state-of-the-art atom probe microscopes yield data sets with sizes ranging from a few million atoms to billions of atoms. Analysis of these atomic data sets within rea-sonable turnaround times is a pressing data analysis challenge for material scientists currently equipped with software systems that do not scale to these massive data sets. Here, we present the shared memory component of a larger ongoing effort to develop a multi-feature data analysis framework capable of analyzing atom probe data of all sizes and scales from desktop multicore machines to large-scale high-performance computing platforms with hybrid (shared and distributed memory) architectures. Our focus here is on a broad class of popular atom probe data analysis methods that rely on core time-consuming k-NN queries. We present a scalable, heuristic algorithm for k-NN queries using three-dimensional range trees. To demonstrate its efficacy, the k-NN algorithm is integrated with two use cases of atom probe data analysis methods and the resulting analysis times are shown to speedup by over 20X on a 32-core Cray XC40 node using workloads up to 8 million atoms, which is already beyond the at-scale capabilities of existing atom probe software. Using this k-NN algorithm, we also introduce a novel parameter estimation method for a class of cluster finding methods, called friends-of-friends (FoF) methods, to completely bypass their expensive pre-processing steps. In each case, we validate the results on a variety of control data sets.","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"96 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129253882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Lossless Parallel Implementation of a Turbo Decoder on GPU","authors":"K. Natarajan, N. Chandrachoodan","doi":"10.1109/HiPC.2018.00023","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00023","url":null,"abstract":"Turbo decoders use the recursive BCJR algorithm which is computationally intensive and hard to parallelise. The branch metric and extrinsic log-likelihood ratio computations are easily parallelisable, but the forward and backward metric computation is not parallelisable without compromising bit error rate. This paper proposes a lossless parallelisation technique for Turbo decoders on Graphics Processing Units (GPU). The recursive forward and backward metric computation is formulated as prefix (scan) matrix multiplication problem which is computed on the GPU using parallel prefix sum computation technique. Overall, this method achieves a throughput of 73 Mbps for a 3GPP LTE compliant turbo decoder without any BER loss and latency as low as 61 μs.","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123054781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}