2018 IEEE 25th International Conference on High Performance Computing (HiPC)最新文献

Workflow Simulation Aware and Multi-threading Effective Task Scheduling for Heterogeneous Computing 基于工作流仿真的异构计算多线程有效任务调度

2018 IEEE 25th International Conference on High Performance Computing (HiPC) Pub Date : 2018-12-17 DOI: 10.1109/HiPC.2018.00032

Vasilios I. Kelefouras, K. Djemame

{"title":"Workflow Simulation Aware and Multi-threading Effective Task Scheduling for Heterogeneous Computing","authors":"Vasilios I. Kelefouras, K. Djemame","doi":"10.1109/HiPC.2018.00032","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00032","url":null,"abstract":"Efficient application scheduling is critical for achieving high performance in heterogeneous computing systems. This problem has proved to be NP-complete, heading research efforts in obtaining low complexity heuristics that produce good quality schedules. Although this problem has been extensively studied in the past, all the related works assume the computation costs of application tasks on processors are available a priori, ignoring the fact that the time needed to run/simulate all these tasks is orders of magnitude higher than finding a good quality schedule, especially in heterogeneous systems. In this paper, we propose two new methods applicable to several task scheduling algorithms for heterogeneous computing systems. We showcase both methods by using HEFT well known and popular algorithm, but they are applicable to other algorithms too, such as HCPT, HPS, PETS and CPOP. First, we propose a methodology to reduce the scheduling time of HEFT when the computation costs are unknown, without sacrificing the length of the output schedule (monotonic computation costs); this is achieved by reducing the number of computation costs required by HEFT and as a consequence the number of simulations applied. Second, we give heuristics to find which tasks are going to be executed as Single-Thread and which as Multi-Thread CPU implementations, as well as the number of the threads used. The experimental results considering both random graphs and real world applications show that extending HEFT with the two proposed methods achieves better schedule lengths, while at the same time requires from 4.5 up to 24 less simulations.","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126476020","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Decentralized Privacy-Preserving Timed Execution in Blockchain-Based Smart Contract Platforms 基于区块链的智能合约平台中去中心化保护隐私的定时执行

2018 IEEE 25th International Conference on High Performance Computing (HiPC) Pub Date : 2018-12-01 DOI: 10.1109/HiPC.2018.00037

Chao Li, Balaji Palanisamy

{"title":"Decentralized Privacy-Preserving Timed Execution in Blockchain-Based Smart Contract Platforms","authors":"Chao Li, Balaji Palanisamy","doi":"10.1109/HiPC.2018.00037","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00037","url":null,"abstract":"Timed transaction execution is critical for various decentralized privacy-preserving applications powered by blockchain-based smart contract platforms. Such privacy-preserving smart contract applications need to be able to securely maintain users' sensitive inputs off the blockchain until a prescribed execution time and then automatically make the inputs available to enable on-chain execution of the target function at the execution time, even if the user goes offline. While straight-forward centralized approaches provide a basic solution to the problem, unfortunately they are limited to a single point of trust. This paper presents a new decentralized privacy-preserving transaction scheduling approach that allows users of Ethereum-based decentralized applications to schedule transactions without revealing sensitive inputs before an execution time window selected by the users. The proposed approach involves no centralized party and allows users to go offline at their discretion after scheduling a transaction. The sensitive inputs are privately maintained by a set of trustees randomly selected from the network enabling the inputs to be revealed only at the execution time. The proposed protocol employs secret key sharing and layered encryption techniques and economic deterrence models to securely protect the sensitive information against possible attacks including some trustees destroying the sensitive information or secretly releasing the sensitive information prior to the execution time. We demonstrate the attack-resilience of the proposed approach through rigorous analysis. Our implementation and experimental evaluation on the Ethereum official test network demonstrates that the proposed approach is effective and has a low gas cost and time overhead associated with it.","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114674071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Sampled Dense Matrix Multiplication for High-Performance Machine Learning 用于高性能机器学习的采样密集矩阵乘法

2018 IEEE 25th International Conference on High Performance Computing (HiPC) Pub Date : 2018-12-01 DOI: 10.1109/HiPC.2018.00013

Israt Nisa, Aravind Sukumaran-Rajam, Süreyya Emre Kurt, Changwan Hong, P. Sadayappan

{"title":"Sampled Dense Matrix Multiplication for High-Performance Machine Learning","authors":"Israt Nisa, Aravind Sukumaran-Rajam, Süreyya Emre Kurt, Changwan Hong, P. Sadayappan","doi":"10.1109/HiPC.2018.00013","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00013","url":null,"abstract":"Many machine learning methods involve iterative optimization and are amenable to a variety of alternate formulations. Many currently popular formulations for some machine learning methods based on core operations that essentially correspond to sparse matrix-vector products. A reformulation using sparse matrix-matrix products primitives can potentially enable significant performance enhancement. Sampled Dense-Dense Matrix Multiplication (SDDMM) is a primitive that has been shown to be usable as a core component in reformulations of many machine learning factor analysis algorithms such as Alternating Least Squares (ALS), Latent Dirichlet Allocation (LDA), Sparse Factor Analysis (SFA), and Gamma Poisson (GaP). It requires the computation of the product of two input dense matrices but only at locations of the result matrix corresponding to nonzero entries in a sparse third input matrix. In this paper, we address the development of cuSDDMM, a multi-node GPU-accelerated implementation for SDDMM. We analyze the data reuse characteristics of SDDMM and develop a model-driven strategy for choice of tiling permutation and tile-size choice. cuSDDMM improves significantly (up to 4.6x) over the best currently available GPU implementation of SDDMM (in the BIDMach Machine Learning library).","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125261044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Code and Data Transformations to Address Garbage Collector Performance in Big Data Processing 解决大数据处理中垃圾收集器性能的代码和数据转换

2018 IEEE 25th International Conference on High Performance Computing (HiPC) Pub Date : 2018-12-01 DOI: 10.1109/HiPC.2018.00040

Damon Fenacci, H. Vandierendonck, Dimitrios S. Nikolopoulos

{"title":"Code and Data Transformations to Address Garbage Collector Performance in Big Data Processing","authors":"Damon Fenacci, H. Vandierendonck, Dimitrios S. Nikolopoulos","doi":"10.1109/HiPC.2018.00040","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00040","url":null,"abstract":"Java, with its dynamic runtime environment and garbage collected GC memory management, is a very popular choice for big data processing engines. Its runtime provides convenient mechanisms to implement workload distribution without having to worry about direct memory allocation and deallocation. However, efficient memory usage is a recurring issue. In particular, our evaluation shows that garbage collection has huge drawbacks when handling a large number of data objects and more than 60% of execution time can be consumed by garbage collection. We present a set of unconventional strategies to counter this issue that rely on data layout transformations to drastically reduce the number of objects, and on changing the code structure to reduce the lifetime of objects. We encapsulate the implementation in Apache Spark making it transparent for software developers. Our preliminary results show an average speedup of 1.54 and a highest of 8.23 over a range of applications, datasets and GC types. In practice, this can provide a substantial reduction in execution time or allow a sizeable reduction in the amount of compute power needed for the same task.","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"241 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115587534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Balancing Stragglers Against Staleness in Distributed Deep Learning 分布式深度学习中如何平衡掉队者和过时者

2018 IEEE 25th International Conference on High Performance Computing (HiPC) Pub Date : 2018-12-01 DOI: 10.1109/HiPC.2018.00011

Saurav Basu, Vaibhav Saxena, Rintu Panja, Ashish Verma

{"title":"Balancing Stragglers Against Staleness in Distributed Deep Learning","authors":"Saurav Basu, Vaibhav Saxena, Rintu Panja, Ashish Verma","doi":"10.1109/HiPC.2018.00011","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00011","url":null,"abstract":"Synchronous SGD is frequently the algorithm of choice for training deep learning models on compute clusters within reasonable time frames. However, even if a large number of workers (CPUs or GPUs) are at disposal for training, hetero-geneity of compute nodes and unreliability of the interconnecting network frequently pose a bottleneck to the training speed. Since the workers have to wait for each other at every model update step, even a single straggler/slow worker can derail the whole training performance. In this paper, we propose a novel approach to mitigate the straggler problem in large compute clusters. We cluster the compute nodes into multiple groups where each group updates the model synchronously stored in its own parameter server. The parameter servers of the different groups update the model in a central parameter server in an asynchronous manner. Few stragglers in the same group (or even separate groups) have little effect on the computational performance. The staleness of the asynchronous updates can be controlled by limiting the number of groups. Our method, in essence, provides a mechanism to move seamlessly between a pure synchronous and a pure asynchronous setting, thereby balancing between the computational overhead of synchronous SGD and the accuracy degradation of a pure asynchronous SGD. We empirically show that with increasing delay from straggler nodes (more than 300% delay in a node), progressive grouping of available workers still finishes the training within 20% of the no-delay case, with the limit to the number of groups governed by the permissible degradation in accuracy (≤ 2.5% compared to the no-delay case).","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124514961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Adaptive Runtime Features for Distributed Graph Algorithms 分布式图算法的自适应运行时特性

2018 IEEE 25th International Conference on High Performance Computing (HiPC) Pub Date : 2018-12-01 DOI: 10.1109/HiPC.2018.00018

J. Firoz, Marcin Zalewski, Joshua D. Suetterlein, A. Lumsdaine

{"title":"Adaptive Runtime Features for Distributed Graph Algorithms","authors":"J. Firoz, Marcin Zalewski, Joshua D. Suetterlein, A. Lumsdaine","doi":"10.1109/HiPC.2018.00018","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00018","url":null,"abstract":"Performance of distributed graph algorithms can benefit greatly by forming rapport between algorithmic abstraction and the underlying runtime system that is responsible for scheduling work and exchanging messages. However, due to their dynamic and irregular nature of computation, distributed graph algorithms written in different programming models impose varying degrees of workload pressure on the runtime. To cope with such vastly different workload characteristics, a runtime has to make several trade-offs. One such trade-off arises, for example, when the runtime scheduler has to choose among alternatives such as whether to execute algorithmic work, or progress the network by probing network buffers, or throttle sending messages (termed flow control). This trade-off decides between optimizing the throughput of a runtime scheduler by increasing the rate of execution of algorithmic work, and reducing the latency of the network messages. Another trade-off exists when a decision has to be made about when to send aggregated messages in buffers (message coalescing). This decision chooses between trading off latency for network bandwidth and vice versa. At any instant, such trade-offs emphasize either on improving the quantity of work being executed (by maximizing the scheduler throughput) or on improving the quality of work (by prioritizing better work). However, encoding static policies for different runtime features (such as flow control, coalescing) can prevent graph algorithms from achieving their full potentials, thus can under-mine the actual performance of a distributed graph algorithm . In this paper, we investigate runtime support for distributed graph algorithms in the context of two paradigms: variants of well-known Bulk-Synchronous Parallel model and asynchronous programming model. We explore generic runtime features such as message coalescing (aggregation) and flow control and show that execution policies of these features need to be adjusted over time to make a positive impact on the execution time of a distributed graph algorithm. Since synchronous and asynchronous graph algorithms have different workload characteristics, not all of such runtime features may be good candidates for adaptation. Each of these algorithmic paradigms may require different set of features to be adapted over time. We demonstrate which set of feature(s) can be useful in each case to achieve the right balance of work in the runtime layer. Existing implementation of different graph algorithms can benefit from adapting dynamic policies in the underlying runtime.","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133828206","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Novel Approach for Handling Soft Error in Conjugate Gradients 共轭梯度软误差处理的新方法

2018 IEEE 25th International Conference on High Performance Computing (HiPC) Pub Date : 2018-12-01 DOI: 10.1109/HiPC.2018.00030

M. E. Ozturk, Marissa Renardy, Yukun Li, G. Agrawal, Ching-Shan Chou

引用次数: 2

A Performance Prediction Framework for Irregular Applications 不规则应用程序的性能预测框架

2018 IEEE 25th International Conference on High Performance Computing (HiPC) Pub Date : 2018-12-01 DOI: 10.1109/HiPC.2018.00042

Gangyi Zhu, G. Agrawal

{"title":"A Performance Prediction Framework for Irregular Applications","authors":"Gangyi Zhu, G. Agrawal","doi":"10.1109/HiPC.2018.00042","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00042","url":null,"abstract":"Predicting performance of applications is an important requirement for many goals – choosing future procurements or upgrades, selecting specific optimization/implementation, requesting and allocating resources, and others. Irregular access patterns, commonly seen in many compute-intensive and data-intensive applications, pose many challenges in estimating overall execution time of applications, including, but not limit to, cache behavior. While much work exists on analysis of cache behavior with regular accesses, relatively little attention has been paid to irregular codes. In this paper, we aim to predict execution time of irregular applications on different hardware configurations, with emphasis on analyzing cache behavior with varying size of the cache and the number of nodes. Cache performance of irregular computations is highly input-dependent. Based on the sparse matrix view of irregular computation as well as the cache locality analysis, we propose a novel sampling approach named Adaptive Stratified Row sampling – this method is capable of generating a representative sample that delivers cache performance similar to the original input. On top of our sampling method, we incorporate reuse distance analysis to accommodate different cache configurations with high efficiency. Besides, we modify SKOPE, a code skeleton framework, to predict the execution time for irregular applications with the predicted cache performance. The results show that our approaches keep average error rates under 6% in predicting L1 cache miss rate for different cache configurations. The average error rates of predicting execution time for sequential and parallel scenarios are under 5% and 15%, respectively.","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127672638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Scalable Proximity-Based Methods for Large-Scale Analysis of Atom Probe Data 基于可扩展邻近度的原子探针数据大规模分析方法

2018 IEEE 25th International Conference on High Performance Computing (HiPC) Pub Date : 2018-12-01 DOI: 10.1109/HiPC.2018.00034

Hao Lu, S. Seal, J. Poplawsky

{"title":"Scalable Proximity-Based Methods for Large-Scale Analysis of Atom Probe Data","authors":"Hao Lu, S. Seal, J. Poplawsky","doi":"10.1109/HiPC.2018.00034","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00034","url":null,"abstract":"Powered by recent advances in data acquisition technologies, today's state-of-the-art atom probe microscopes yield data sets with sizes ranging from a few million atoms to billions of atoms. Analysis of these atomic data sets within rea-sonable turnaround times is a pressing data analysis challenge for material scientists currently equipped with software systems that do not scale to these massive data sets. Here, we present the shared memory component of a larger ongoing effort to develop a multi-feature data analysis framework capable of analyzing atom probe data of all sizes and scales from desktop multicore machines to large-scale high-performance computing platforms with hybrid (shared and distributed memory) architectures. Our focus here is on a broad class of popular atom probe data analysis methods that rely on core time-consuming k-NN queries. We present a scalable, heuristic algorithm for k-NN queries using three-dimensional range trees. To demonstrate its efficacy, the k-NN algorithm is integrated with two use cases of atom probe data analysis methods and the resulting analysis times are shown to speedup by over 20X on a 32-core Cray XC40 node using workloads up to 8 million atoms, which is already beyond the at-scale capabilities of existing atom probe software. Using this k-NN algorithm, we also introduce a novel parameter estimation method for a class of cluster finding methods, called friends-of-friends (FoF) methods, to completely bypass their expensive pre-processing steps. In each case, we validate the results on a variety of control data sets.","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"96 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129253882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Lossless Parallel Implementation of a Turbo Decoder on GPU Turbo解码器在GPU上的无损并行实现

2018 IEEE 25th International Conference on High Performance Computing (HiPC) Pub Date : 2018-12-01 DOI: 10.1109/HiPC.2018.00023

K. Natarajan, N. Chandrachoodan

引用次数: 3