H. Kanezashi, T. Suzumura, D. García-Gasulla, Min-hwan Oh, S. Matsuoka
{"title":"Adaptive Pattern Matching with Reinforcement Learning for Dynamic Graphs","authors":"H. Kanezashi, T. Suzumura, D. García-Gasulla, Min-hwan Oh, S. Matsuoka","doi":"10.1109/HiPC.2018.00019","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00019","url":null,"abstract":"Graph pattern matching algorithms to handle million-scale dynamic graphs are widely used in many applications such as social network analytics and suspicious transaction detections from financial networks. On the other hand, the computation complexity of many graph pattern matching algorithms is expensive, and it is not affordable to extract patterns from million-scale graphs. Moreover, most real-world networks are time-evolving, updating their structures continuously, which makes it harder to update and output newly matched patterns in real time. Many incremental graph pattern matching algorithms which reduce the number of updates have been proposed to handle such dynamic graphs. However, it is still challenging to recompute vertices in the incremental graph pattern matching algorithms in a single process, and that prevents the real-time analysis. We propose an incremental graph pattern matching algorithm to deal with time-evolving graph data and also propose an adaptive optimization system based on reinforcement learning to recompute vertices in the incremental process more efficiently. Then we discuss the qualitative efficiency of our system with several types of data graphs and pattern graphs. We evaluate the performance using million-scale attributed and time-evolving social graphs. Our incremental algorithm is up to 10.1 times faster than an existing graph pattern matching and 1.95 times faster with the adaptive systems in a computation node than naive incremental processing.","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"166 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122189147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Prasanna Balaprakash, Michael A. Salim, T. Uram, V. Vishwanath, Stefan M. Wild
{"title":"DeepHyper: Asynchronous Hyperparameter Search for Deep Neural Networks","authors":"Prasanna Balaprakash, Michael A. Salim, T. Uram, V. Vishwanath, Stefan M. Wild","doi":"10.1109/HiPC.2018.00014","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00014","url":null,"abstract":"Hyperparameters employed by deep learning (DL) methods play a substantial role in the performance and reliability of these methods in practice. Unfortunately, finding performance optimizing hyperparameter settings is a notoriously difficult task. Hyperparameter search methods typically have limited production-strength implementations or do not target scalability within a highly parallel machine, portability across different machines, experimental comparison between different methods, and tighter integration with workflow systems. In this paper, we present DeepHyper, a Python package that provides a common interface for the implementation and study of scalable hyperparameter search methods. It adopts the Balsam workflow system to hide the complexities of running large numbers of hyperparameter configurations in parallel on high-performance computing (HPC) systems. We implement and study asynchronous model-based search methods that consist of sampling a small number of input hyperparameter configurations and progressively fitting surrogate models over the input-output space until exhausting a user-defined budget of evaluations. We evaluate the efficacy of these methods relative to approaches such as random search, genetic algorithms, Bayesian optimization, and hyperband on DL benchmarks on CPU-and GPU-based HPC systems.","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128092830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Compiling SIMT Programs on Multi- and Many-Core Processors with Wide Vector Units: A Case Study with CUDA","authors":"Hancheng Wu, J. Ravi, M. Becchi","doi":"10.1109/HiPC.2018.00022","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00022","url":null,"abstract":"Manycore processors and coprocessors with wide vector extensions, such as Intel Phi and Skylake devices, have become popular due to their high throughput capability. Performance optimization on these devices requires using both their x86-compatible cores and their vector units. While the x86-compatible cores can be programmed using traditional programming interfaces following the MIMD model, such as POSIX threads, MPI and OpenMP, the SIMD vector units are harder to program. The Intel software stack provides two approaches for code vectorization: automatic vectorization through the Intel compiler and manual vectorization through vector intrinsics. While the Intel compiler often fails to vectorize code with complex control flows and function calls, the manual approach is error-prone and leads to less portable code. Hence, there has been an increasing interest in SIMT programming tools allowing the simultaneous use of x86 cores and vector units while providing programmability and code portability. However, the effective implementation of the SIMT model on these hybrid architectures is not well understood. In this work, we target this problem. First, we propose a set of compiler techniques to transform programs written using a SIMT programming model (a subset of CUDA C) into code that leverages both the x86 cores and the vector units of a hybrid MIMD/SIMD architecture, thus providing programmability, high system utilization and performance. Second, we evaluate the proposed techniques on Xeon Phi and Skylake processors using micro-benchmarks and real-world applications. Third, we compare the resulting performance with that achieved by the same code on GPUs. Based on this analysis, we point out the main challenges in supporting the SIMT model on hybrid MIMD/SIMD architectures, while providing performance comparable to that of SIMT systems (e.g., GPUs).","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134254061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Expediting Parallel Graph Connectivity Algorithms","authors":"Kishore Kothapalli, Mihir Wadwekar","doi":"10.1109/HiPC.2018.00017","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00017","url":null,"abstract":"Finding whether a graph is k-connected, and the identification of its k-connected components is a fundamental problem in graph theory. For this reason, there have been several algorithms for this problem in both the sequential and parallel settings. Several recent sequential and parallel algorithms for k-connectivity rely on one or more breadth-first traversals of the input graph. While BFS can be made very efficient in a sequential setting, the same cannot be said in the case of parallel environments. A major factor in this difficulty is due to the inherent requirement to use a shared queue, balance work among multiple threads in every round, synchronization, and the like. Optimizing the execution of BFS on many current parallel architectures is therefore quite challenging. For this reason, it can be noticed that the time spent by the current parallel graph connectivity algorithms on BFS operations is usually a significant portion of their overall runtime. In this paper, we study how one can, in the context of algorithms for graph connectivity, mitigate the practical inefficiency of relying on BFS operations in parallel. Our technique suggests that such algorithms may not require a BFS of the input graph but actually can work with a sparse spanning subgraph of the input graph. The incorrectness introduced by not using a BFS spanning tree can then be offset by further post-processing steps on suitably defined small auxiliary graphs. Our experiments on finding the 2, and 3-connectivity of graphs on Nvidia K40c GPUs improve the state-of-the-art on the corresponding problems by a factor 2.2x, and 2.1x respectively.","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131876315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Shared-Memory Parallel Maximal Clique Enumeration","authors":"A. Das, Seyed-Vahid Sanei-Mehri, S. Tirthapura","doi":"10.1109/HiPC.2018.00016","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00016","url":null,"abstract":"We present shared-memory parallel methods for Maximal Clique Enumeration (MCE) from a graph. MCE is a fundamental and well-studied graph analytics task, and is a widely used primitive for identifying dense structures in a graph. Due to its computationally intensive nature, parallel methods are imperative for dealing with large graphs. However, surprisingly, there do not yet exist scalable and parallel methods for MCE on a shared-memory parallel machine. In this work, we present efficient shared-memory parallel algorithms for MCE, with the following properties: (1) the parallel algorithms are provably work-efficient relative to a state-of-the-art sequential algorithm (2) the algorithms have a provably small parallel depth, showing that they can scale to a large number of processors, and (3) our implementations on a multicore machine shows a good speedup and scaling behavior with increasing number of cores, and are substantially faster than prior shared-memory parallel algorithms for MCE.","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-07-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127513371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel Nonnegative CP Decomposition of Dense Tensors","authors":"Grey Ballard, Koby Hayashi, R. Kannan","doi":"10.1109/HiPC.2018.00012","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00012","url":null,"abstract":"The CP tensor decomposition is a low-rank approximation of a tensor. We present a distributed-memory parallel algorithm and implementation of an alternating optimization method for computing a CP decomposition of dense tensors that can enforce nonnegativity of the computed low-rank factors. The principal task is to parallelize the Matricized-Tensor Times Khatri-Rao Product (MTTKRP) bottleneck subcomputation. The algorithm is computation efficient, using dimension trees to avoid redundant computation across MTTKRPs within the alternating method. Our approach is also communication efficient, using a data distribution and parallel algorithm across a multidimensional processor grid that can be tuned to minimize communication. We benchmark our software on synthetic as well as hyperspectral image and neuroscience dynamic functional connectivity data, demonstrating that our algorithm scales well to 100s of nodes (up to 4096 cores) and is faster and more general than the currently available parallel software.","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128156560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Why do Users Kill HPC Jobs?","authors":"Venkatesh Prasad Ranganath, Daniel Andresen","doi":"10.1109/HiPC.2018.00039","DOIUrl":"https://doi.org/10.1109/HiPC.2018.00039","url":null,"abstract":"Given the cost of HPC clusters, making best use of them is crucial to improve infrastructure ROI. Likewise, reducing failed HPC jobs and related waste in terms of user wait times is crucial to improve HPC user productivity (aka human ROI). While most efforts (e.g., debugging HPC programs) explore technical aspects to improve ROI of HPC clusters, we hypothesize non-technical (human) aspects are worth exploring to make non-trivial ROI gains; specifically, understanding non-technical aspects and how they contribute to the failure of HPC jobs. In this regard, we conducted a case study in the context of Beocat cluster at Kansas State University. The purpose of the study was to learn the reasons why users terminate jobs and to quantify wasted computations in such jobs in terms of system utilization and user wait time. The data from the case study helped identify interesting and actionable reasons why users terminate HPC jobs. It also helped confirm that user terminated jobs may be associated with non-trivial amount of wasted computation, which if reduced can help improve the ROI of HPC clusters.","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-05-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122581055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Future of Supercomputing","authors":"M. Snir","doi":"10.1145/2597652.2616585","DOIUrl":"https://doi.org/10.1145/2597652.2616585","url":null,"abstract":"For over two decades, supercomputing evolved in a relatively straightforward manner: Supercomputers were assembled out of commodity microprocessors and leveraged their exponential increase in performance, due to Moore's Law. This simple model has been under stress since clock speed stopped growing a decade ago: Increased performance has required a commensurate increase in the number of concurrent threads. The evolution of device technology is likely to be even less favorable in the coming decade: The growth in CMOS performance is nearing its end, and no alternative technology is ready to replace CMOS. The continued shrinking of device size requires increasingly expensive technologies, and may not lead to improvements in cost/performance ratio; at which point, it ceases to make sense for commodity technology. These obstacles need not imply stagnation in supercomputer performance. In the long run, new computing models will come to the rescue. In the short run, more exotic, non-commodity device technologies can provide two or more orders of magnitude improvements in performance. Finally, better hardware and software architectures can significantly increase the efficiency of scientific computing platforms. While continued progress is possible, it will require a significant international research effort and major investments in future large-scale \"computational instruments\".","PeriodicalId":113335,"journal":{"name":"2018 IEEE 25th International Conference on High Performance Computing (HiPC)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128046371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}