2014 21st International Conference on High Performance Computing (HiPC)最新文献_第4页

Cache-conscious scheduling of streaming pipelines on parallel machines with private caches 具有私有缓存的并行机器上流管道的缓存意识调度

2014 21st International Conference on High Performance Computing (HiPC) Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116893

Kunal Agrawal, Jordyn C. Maglalang, Jeremy T. Fineman

{"title":"Cache-conscious scheduling of streaming pipelines on parallel machines with private caches","authors":"Kunal Agrawal, Jordyn C. Maglalang, Jeremy T. Fineman","doi":"10.1109/HiPC.2014.7116893","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116893","url":null,"abstract":"This paper studies the problem of scheduling a streaming pipeline on a multicore machine with private caches to maximize throughput. The theoretical contribution includes lower and upper bounds in the parallel external-memory model. We show that a simple greedy scheduling strategy is asymptotically optimal with a constant-factor memory augmentation. More specifically, we show that if our strategy has a running time of Q cache misses on a machine with size-M caches, then every “static” scheduling policy must have time at least that of Q(Q) cache misses on a machine with size-M/6 caches. Our experimental study considers the question of whether scheduling based on cache effects is more important than scheduling based on only the number of computation steps. Using synthetic pipelines with a range of parameters, we compare our cache-based partitioning against several other static schedulers that load-balance computation. In most cases, the cache-based partitioning indeed beats the other schedulers, but there are some cases that go the other way. We conclude that considering cache effects is a good idea, but other features of the streaming pipeline are also important.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"150 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115037382","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

CQA: A code quality analyzer tool at binary level 二进制级别的代码质量分析工具

2014 21st International Conference on High Performance Computing (HiPC) Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116904

Andres Charif Rubial, Emmanuel Oseret, Jose Noudohouenou, W. Jalby, G. Lartigue

{"title":"CQA: A code quality analyzer tool at binary level","authors":"Andres Charif Rubial, Emmanuel Oseret, Jose Noudohouenou, W. Jalby, G. Lartigue","doi":"10.1109/HiPC.2014.7116904","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116904","url":null,"abstract":"Most of today's performance analysis tools are focused on issues occurring at multi-core and communication level. However there are several reasons why an application may not correctly behave in terms of performance at the core level. For a significant part, loops in industrial applications are limited by the quality of the code generated by the compiler and do not always fully benefit from the available computing power of recent processors. For instance, when the compiler is not able to vectorize loops, up to a 8x factor can be lost. It is essential to first validate the core level performance before focusing on higher level issues. This paper presents the CQA tool, a loop-centric code quality analyzer based on a simplified unicore architecture performance modeling and on quality metrics. The tool analyzes the quality of the code generated by the compiler. It provides high level metrics along with human understandable reports that relates to source code. Our performance model assumes that all data are resident in the first level cache. It provides architectural bottlenecks and an estimation of the number of cycles spent in each iteration of a given innermost loop. Our modeling and analyses are statically done and requires no execution or recompilation of the application. We show practical examples of situations where our tool is able to provide very valuable information leading to a performance gain.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"130 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127489297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 33

Performance evaluation of multi core systems for high throughput medical applications involving model predictive control 涉及模型预测控制的高通量医疗应用多核系统的性能评估

2014 21st International Conference on High Performance Computing (HiPC) Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116884

Madhurima Pore, Ayan Banerjee, S. Gupta

{"title":"Performance evaluation of multi core systems for high throughput medical applications involving model predictive control","authors":"Madhurima Pore, Ayan Banerjee, S. Gupta","doi":"10.1109/HiPC.2014.7116884","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116884","url":null,"abstract":"Many medical control devices used in case of critical patients have model predictive controllers (MPC). MPC estimate the drug level in the parts of patients body based on their human physiology model to either alarm the medical authority or change the drug infusion rate. This model prediction has to be completed before the drug infusion rate is changed i.e. every few seconds. Instead of mathematical models like the Pharmacokinetic models more accurate models such as spatio-temporal drug diffusion can be used for improving the prediction and prevention of drug overshoot and undershoot. However, these models require high computation capability of platforms like recent many core GPUs or Intel Xeon Phi (MIC) or IntelCore i7. This work explores thread level and data level parallelism and computation versus communication times of such different model predictive applications used in multiple patient monitoring in hospital data centers exploiting the many core platforms for maximizing the throughput (i.e. patients monitored simultaneously). We also study the energy and performance of these applications to evaluate them for architecture suitability. We show that given a set of MPC applications, mapping on heterogeneous platforms can give performance improvement and energy savings.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129427499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Balancing context switch penalty and response time with elastic time slicing 通过弹性时间切片平衡上下文切换惩罚和响应时间

2014 21st International Conference on High Performance Computing (HiPC) Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116707

Nagakishore Jammula, Moinuddin K. Qureshi, Ada Gavrilovska, Jongman Kim

{"title":"Balancing context switch penalty and response time with elastic time slicing","authors":"Nagakishore Jammula, Moinuddin K. Qureshi, Ada Gavrilovska, Jongman Kim","doi":"10.1109/HiPC.2014.7116707","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116707","url":null,"abstract":"Virtualization allows the platform to have increased number of logical processors by multiplexing the underlying resources across different virtual machines. The hardware resources get time shared not only between different virtual machines, but also between different workloads of the same virtual machine. An important source of performance degradation in such a scenario comes from the cache warmup penalties a workload experiences when it gets scheduled, as the working set belonging to the workload gets displaced by other concurrently running workloads. We show that a virtual machine that time switches between four workloads can cause some of the workloads a slowdown of as much as 54%. However, such performance degradation depends on the workload behavior, with some workloads experiencing negligible degradation and some severe degradation. We propose Elastic Time Slicing (ETS) to reduce the context switch overhead for the most affected workloads. We demonstrate that by taking the workload-specific context switch overhead into consideration, the CPU scheduler can make better decisions to minimize the context switch penalty for the most affected workloads, thereby resulting in substantial performance improvements. ETS enhances performance without compromising on response time, thereby achieving dual benefits. To facilitate ETS, we develop a low-overhead hardware-based mechanism that dynamically estimates the sensitivity of a given workload to context switching. We evaluate the accuracy of the mechanism under various cache management policies and show that it is very reliable. Context switch related warmup penalties increase as optimizations are applied to address traditional cache misses. For the first time, we assess the impact of advanced replacement policies and establish that it is significant.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125979252","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

High performance MPI library over SR-IOV enabled infiniband clusters 高性能MPI库，支持SR-IOV的ib集群

2014 21st International Conference on High Performance Computing (HiPC) Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116876

Jie Zhang, Xiaoyi Lu, Jithin Jose, Mingzhe Li, Rong Shi, D. Panda

{"title":"High performance MPI library over SR-IOV enabled infiniband clusters","authors":"Jie Zhang, Xiaoyi Lu, Jithin Jose, Mingzhe Li, Rong Shi, D. Panda","doi":"10.1109/HiPC.2014.7116876","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116876","url":null,"abstract":"Virtualization has become a central role in HPC Cloud due to easy management and low cost of computation and communication. Recently, Single Root I/O Virtualization (SR-IOV) technology has been introduced for high-performance interconnects such as InfiniBand and can attain near to native performance for inter-node communication. However, the SR-IOV scheme lacks locality aware communication support, which leads to performance overheads for inter-VM communication within a same physical node. To address this issue, this paper first proposes a high performance design of MPI library over SR-IOV enabled InfiniBand clusters by dynamically detecting VM locality and coordinating data movements between SR-IOV and Inter-VM shared memory (IVShmem) channels. Through our proposed design, MPI applications running in virtualized mode can achieve efficient locality-aware communication on SR-IOV enabled InfiniBand clusters. In addition, we optimize communications in IVShmem and SR-IOV channels by analyzing the performance impact of core mechanisms and parameters inside MPI library to deliver better performance in virtual machines. Finally, we conduct comprehensive performance studies by using point-to-point and collective benchmarks, and HPC applications. Experimental evaluations show that our proposed MPI library design can significantly improve the performance for point-to-point and collective operations, and MPI applications with different InfiniBand transport protocols (RC and UD) by up to 158%, 76%, 43%, respectively, compared with SR-IOV. To the best of our knowledge, this is the first study to offer a high performance MPI library that supports efficient locality aware MPI communication over SR-IOV enabled InfiniBand clusters.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"95 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132988300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

Designing efficient small message transfer mechanism for inter-node MPI communication on InfiniBand GPU clusters 在InfiniBand GPU集群上设计高效的节点间MPI通信小消息传输机制

2014 21st International Conference on High Performance Computing (HiPC) Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116873

Rong Shi, S. Potluri, Khaled Hamidouche, Jonathan L. Perkins, Mingzhe Li, D. Rossetti, D. Panda

{"title":"Designing efficient small message transfer mechanism for inter-node MPI communication on InfiniBand GPU clusters","authors":"Rong Shi, S. Potluri, Khaled Hamidouche, Jonathan L. Perkins, Mingzhe Li, D. Rossetti, D. Panda","doi":"10.1109/HiPC.2014.7116873","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116873","url":null,"abstract":"Increasing number of MPI applications are being ported to take advantage of the compute power offered by GPUs. Data movement on GPU clusters continues to be the major bottleneck that keeps scientific applications from fully harnessing the potential of GPUs. Earlier, GPU-GPU inter-node communication has to move data from GPU memory to host memory before sending it over the network. MPI libraries like MVAPICH2 have provided solutions to alleviate this bottleneck using host-based pipelining techniques. Besides that, the newly introduced GPU Direct RDMA (GDR) is a promising solution to further solve this data movement bottleneck. However, existing design in MPI libraries applies the rendezvous protocol for all message sizes, which incurs considerable overhead for small message communications due to extra synchronization message exchange. In this paper, we propose new techniques to optimize internode GPU-to-GPU communications for small message sizes. Our designs to support the eager protocol include efficient support at both sender and receiver sides. Furthermore, we propose a new data path to provide fast copies between host and GPU memories. To the best of our knowledge, this is the first study to propose efficient designs for GPU communication for small message sizes, using eager protocol. Our experimental results demonstrate up to 59% and 63% reduction in latency for GPU-to-GPU and CPU-to-GPU point-to-point communications, respectively. These designs boost the uni-directional bandwidth by 7.3x and 1.7x, respectively. We also evaluate our proposed design with two end-applications: GPULBM and HOOMD-blue. Performance numbers on Kepler GPUs shows that, compared to the best existing GDR design, our proposed designs achieve up to 23.4% latency reduction for GPULBM and 58% increase in average TPS for HOOMD-blue, respectively.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127118510","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31

Relax-Miracle: GPU parallelization of semi-analytic fourier-domain solvers for earthquake modeling 松弛奇迹:地震建模半解析傅立叶域求解器的GPU并行化

2014 21st International Conference on High Performance Computing (HiPC) Pub Date : 2014-12-01 DOI: 10.1109/HIPC.2014.7116901

S. Masuti, S. Barbot, Nachiket Kapre

{"title":"Relax-Miracle: GPU parallelization of semi-analytic fourier-domain solvers for earthquake modeling","authors":"S. Masuti, S. Barbot, Nachiket Kapre","doi":"10.1109/HIPC.2014.7116901","DOIUrl":"https://doi.org/10.1109/HIPC.2014.7116901","url":null,"abstract":"Effective utilization of GPU processing capacity for scientific workloads is often limited by memory throughput and PCIe communication transfer times. This is particularly true for semi-analytic Fourier-domain computations in earthquake modeling (Relax) where operations on large-scale 3D data structures can require moving large volumes of data from storage to the compute in predictable but orthogonal access patterns. We show how to transform the computation to avoid PCIe transfers entirely by reconstructing the 3D data structures directly within the GPU global memory. We also consider arithmetic transformations that replace some communication-intensive 1D FFTs with simpler, data-parallel analytical solutions. Using our approach we are able to reduce computation times for a geophysical model of the 2012 Mw8.7 Wharton Basin earthquake from 2 hours down to 15 minutes (speedup of ≈8x) for grid sizes of 512-512-256 when comparing NVIDIA K20 with a 16-threaded Intel Xeon E5-2670 CPU (supported by Intel-MKL libraries). Our GPU-accelerated solution (called Relax-Miracle) also makes it possible to conduct Markov-Chain Monte-Carlo simulations using more than 1000 time-dependent models on 12 GPUs per single day of calculation, enhancing our ability to use such techniques for time-consuming data inversion and Bayesian inversion experiments.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125999671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Online failure prediction for HPC resources using decentralized clustering 基于分散聚类的高性能计算资源在线故障预测

2014 21st International Conference on High Performance Computing (HiPC) Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116903

Alejandro Pelaez, Andres Quiroz, J. Browne, Edward Chuah, M. Parashar

{"title":"Online failure prediction for HPC resources using decentralized clustering","authors":"Alejandro Pelaez, Andres Quiroz, J. Browne, Edward Chuah, M. Parashar","doi":"10.1109/HiPC.2014.7116903","DOIUrl":"https://doi.org/10.1109/HiPC.2014.7116903","url":null,"abstract":"Ensuring high reliability of large-scale clusters is becoming more critical as the size of these machines continues to grow, since this increases the complexity and amount of interactions between different nodes and thus results in a high failure frequency. For this reason, predicting node failures in order to prevent errors from happening in the first place has become extremely valuable. A common approach for failure prediction is to analyze traces of system events to find correlations between event types or anomalous event patterns and node failures, and to use the types or patterns identified as failure predictors at run-time. However, typical centralized solutions for failure prediction in this manner suffer from high transmission and processing overheads at very large scales. We present a solution to the problem of predicting compute node soft-lockups in large scale clusters by using a decentralized online clustering algorithm (DOC) to detect anomalies in resource usage logs, which have been shown to correlate to particular types of node failures in supercomputer clusters. We demonstrate the effectiveness of this system by using the monitoring logs from the Ranger supercomputer at Texas Advanced Computing Center. Experiments shows that this approach can achieve similar accuracy as other related approaches, while maintaining low RAM and bandwidth usage, with a runtime impact to current running applications of less than 2%.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126245724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Optical overlay NUCA: A high speed substrate for shared L2 caches 光学覆盖NUCA:用于共享L2缓存的高速衬底

2014 21st International Conference on High Performance Computing (HiPC) Pub Date : 2014-12-01 DOI: 10.1145/3064833

E. Peter, A. Arora, Akriti Bagaria, S. Sarangi

{"title":"Optical overlay NUCA: A high speed substrate for shared L2 caches","authors":"E. Peter, A. Arora, Akriti Bagaria, S. Sarangi","doi":"10.1145/3064833","DOIUrl":"https://doi.org/10.1145/3064833","url":null,"abstract":"In this paper, we propose to use optical NOCs to design cache access protocols for large shared L2 caches. We observe that the problem is unique because optical networks have very low latency, and in principle all the cache banks are very close to each other. A naive approach is to broadcast a request to a set of banks that might possibly contain the copy of a block. However, this approach is wasteful in terms of energy and bandwidth. Hence, we propose a novel scheme in this paper, TSI, which proposes to create a set of virtual networks (overlays) of cache banks over a physical optical NOC. We search for a block inside each overlay using a combination of multicast and unicast messages. We additionally create support for our overlay networks by proposing optimizations to the previously proposed R-SWMR network. We also propose a set of novel hardware structures for creating and managing overlays, and for efficiently locating blocks in the overlay. The performance of the TSI scheme is within 2-3% of a broadcast scheme, and it is faster than traditional static NUCA schemes by 50%. As compared to the broadcast scheme it reduces the number of accesses, and consequently the dynamic energy by 20-30%.","PeriodicalId":337777,"journal":{"name":"2014 21st International Conference on High Performance Computing (HiPC)","volume":"120 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128259144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Particle advection performance over varied architectures and workloads 粒子平流性能在不同的架构和工作负载

2014 21st International Conference on High Performance Computing (HiPC) Pub Date : 2014-12-01 DOI: 10.1109/HiPC.2014.7116900

H. Childs, Scott Biersdorff, David Poliakoff, David Camp, A. Malony

引用次数: 9