2016 45th International Conference on Parallel Processing (ICPP)最新文献

Parallel k-Means++ for Multiple Shared-Memory Architectures 面向多个共享内存架构的并行k- means++

2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-09-22 DOI: 10.1109/ICPP.2016.18

Patrick Mackey, R. Lewis

引用次数: 6

Optimizing GPU Register Usage: Extensions to OpenACC and Compiler Optimizations 优化GPU寄存器使用:扩展到OpenACC和编译器优化

2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.72

Xiaonan Tian, Dounia Khaldi, Deepak Eachempati, Rengan Xu, B. Chapman

{"title":"Optimizing GPU Register Usage: Extensions to OpenACC and Compiler Optimizations","authors":"Xiaonan Tian, Dounia Khaldi, Deepak Eachempati, Rengan Xu, B. Chapman","doi":"10.1109/ICPP.2016.72","DOIUrl":"https://doi.org/10.1109/ICPP.2016.72","url":null,"abstract":"Using compiler directives to program accelerator-based systems through APIs such as OpenACC or OpenMP has increasingly gained popularity due to the portability and productivity advantages it offers. However, when comparing the performance typically achieved to what lower-level programming interfaces such as CUDA or OpenCL provides, directive-based approaches may entail a significant performance penalty. To support massively parallel computations, accelerators such as GPGPUs offer an expansive set of registers, larger than even the L1 cache, to hold the temporary state of each thread. Scalar variables are the mostly likely candidates to be assigned to these registers by the compiler. Hence, scalar replacement is a key enabling optimization for effectively improving the utilization of register files on accelerator devices and thereby substantially reducing the cost of memory operations. However, the aggressive application of scalar replacement may require a large number of registers, limiting the application of this technique unless mitigating approaches such as those described in this paper are taken. In this paper, we propose solutions to optimize the register usage within offloaded computations using OpenACC directives. We first present a compiler optimization called SAFARA that extends the classical scalar replacement algorithm to improve register file utilization on GPUs. Moreover, we extend the OpenACC interface by providing new clauses, namely dim and small, that will reduce the number of scalars to replace. SAFARA prioritizes the most beneficial data for allocation in registers based on frequency of use and also memory access latency. It also uses a static feedback strategy to retrieve low-level register information in order to guide the compiler in carrying out the scalar replacement transformation. Then, the new clauses we propose will extremely reduce the number of scalars, eliminating the need for more registers. We evaluate SAFARA and the new clauses using SPEC and NAS OpenACC benchmarks, our results suggest that these approaches will be effective for improving overall performance of code executing on GPUs. We got up to 2.5 speedup running NAS and 2.08 speedup while running SPEC benchmarks.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127215328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Run-Time Performance Estimation and Fairness-Oriented Scheduling Policy for Concurrent GPGPU Applications 并行GPGPU应用的运行时性能评估和面向公平的调度策略

2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.14

Qingda Hu, J. Shu, Jie Fan, Youyou Lu

{"title":"Run-Time Performance Estimation and Fairness-Oriented Scheduling Policy for Concurrent GPGPU Applications","authors":"Qingda Hu, J. Shu, Jie Fan, Youyou Lu","doi":"10.1109/ICPP.2016.14","DOIUrl":"https://doi.org/10.1109/ICPP.2016.14","url":null,"abstract":"In order to satisfy the competition of multiple GPU accelerated applications and make full use of GPU resources, a lot of previous works propose spatial-multitasking to execute multiple GPGPU applications simultaneously on a single GPU device. However, when adopting the spatial-multitasking framework, the inter-application interference may slow down different applications differently, leading to the unreasonable allocation of shared resources among concurrent GPGPU applications, degrading system fairness severely and resulting in sub-optimal performance. Thus, it is imperative to develop mechanisms to control negative inter-application interactions and utilize shared resources fairly and efficiently. Quantitatively estimating application slowdowns can enable us to accurately minimize system unfairness. Although several previous works pay attention on showdown estimation for CPUs, we find that they may be inaccurate for GPUs. Therefore, we propose a novel Dynamical Application Slowdown Estimation (DASE) model to estimate application slowdowns accurately. Our evaluations show that DASE has significantly lower estimation error (only 8.8%) than the state-of-the-art estimation models (36.3% and 32.8%) across all two-application workloads. Furthermore, to verify the effectiveness of our DASE model, we leverage our model to develop an efficient fairness-oriented Streaming Multiprocessors (SM) allocation policy DASE-Fair to minimize the overall system unfairness. Compared to the even SM partition policy, DASE-Fair improves fairness dramatically by more than 16.1% on average.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127043041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

The Case for Cross-Component Power Coordination on Power Bounded Systems 功率有界系统的跨组件功率协调问题

2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.66

Rong Ge, Xizhou Feng, Yangyang He, Pengfei Zou

{"title":"The Case for Cross-Component Power Coordination on Power Bounded Systems","authors":"Rong Ge, Xizhou Feng, Yangyang He, Pengfei Zou","doi":"10.1109/ICPP.2016.66","DOIUrl":"https://doi.org/10.1109/ICPP.2016.66","url":null,"abstract":"Modern computer systems are increasingly bounded by the available or permissible power at multiple layers, ranging from a single chip to an entire data center. To cope with this reality, it is necessary to understand how power bounds impact the design and performance of emergent computer systems. In this paper, we study the problem of coordinated power allocation between processors and memory modules on power-bounded systems. We experimentally and analytically investigate the dynamics between cross-component power allocation and application performance, identify the patterns of power allocation scenarios, and develop optimal power allocation methods. In our study, we discover that (1) different applications share categorical patterns with regard to how power allocations among individual components impact application performance and actual power, (2) the per-node power budget must exceed a certain threshold in order to achieve desirable performance and efficiency, (3) there exist workload-specific optimal power allocations under a given power budget and such optimal power coordination can be pinpointed using the heuristics derived from the categorical patterns and a light-weight power-performance profiling. Results from this study demonstrate the importance and feasibility of cross-component coordination to the implementation of power-bound high performance computing technology.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129642378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Randomly Optimized Grid Graph for Low-Latency Interconnection Networks 低延迟互联网络随机优化网格图

2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.46

K. Nakano, Daisuke Takafuji, S. Fujita, Hiroki Matsutani, I. Fujiwara, M. Koibuchi

引用次数: 15

Programming Techniques for the Automata Processor 自动机处理器的编程技术

2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.30

Indranil Roy, Ankit Srivastava, S. Aluru

引用次数: 6

One-Sided Interface for Matrix Operations Using MPI-3 RMA: A Case Study with Elemental 使用MPI-3 RMA的矩阵操作的单边接口:一个与Elemental的案例研究

2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.28

Sayan Ghosh, J. Hammond, Antonio J. Peña, P. Balaji, A. Gebremedhin, B. Chapman

{"title":"One-Sided Interface for Matrix Operations Using MPI-3 RMA: A Case Study with Elemental","authors":"Sayan Ghosh, J. Hammond, Antonio J. Peña, P. Balaji, A. Gebremedhin, B. Chapman","doi":"10.1109/ICPP.2016.28","DOIUrl":"https://doi.org/10.1109/ICPP.2016.28","url":null,"abstract":"A one-sided programming model separates communication from synchronization, and is the driving principle behind partitioned global address space (PGAS) libraries such as Global Arrays (GA) and SHMEM. PGAS models expose a rich set of functionality that a developer needs in order to implement mathematical algorithms that require frequent multidimensional array accesses. However, use of existing PGAS libraries in application codes often requires significant development effort in order to fully exploit these programming models. On the other hand, a vast majority of scientific codes use MPI either directly or indirectly via third-party scientific computation libraries, and need features to support application-specific communication requirements (e.g., asynchronous update of distributed sparse matrices, commonly arising in machine learning workloads). For such codes it is often impractical to completely shift programming models in favor of special one-sided communication middleware. Instead, an elegant and productive solution is to exploit the one-sided functionality already offered by MPI-3 RMA (Remote Memory Access). We designed a general one-sided interface using the MPI-3 passive RMA model for remote matrix operations in the linear algebra library Elemental, we call the interface we designed RMAInterface. Elemental is an open source library for distributed-memory dense and sparse linear algebra and optimization. We employ RMAInterface to construct a Global Arrays-like API and demonstrate its performance scalability and competitivity with that of the existing GA (with ARMCI-MPI) for a quantum chemistry application.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126322389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

A Parallel Hill-Climbing Refinement Algorithm for Graph Partitioning 图划分的并行爬坡细化算法

2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.34

Dominique LaSalle, G. Karypis

引用次数: 40

An Unbounded Nonblocking Double-Ended Queue 一种无界非阻塞双端队列

2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.32

Matthew Graichen, Joseph Izraelevitz, M. Scott

引用次数: 0

Optimal Multi-taxi Dispatch for Mobile Taxi-Hailing Systems 移动叫车系统的最优多车调度

2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.41

Guoju Gao, Mingjun Xiao, Zhenhua Zhao

{"title":"Optimal Multi-taxi Dispatch for Mobile Taxi-Hailing Systems","authors":"Guoju Gao, Mingjun Xiao, Zhenhua Zhao","doi":"10.1109/ICPP.2016.41","DOIUrl":"https://doi.org/10.1109/ICPP.2016.41","url":null,"abstract":"Traditional taxi-hailing systems through wireless networks in metropolitan areas allow taxis to compete for passengers chaotically and accidentally, which generally result in inefficiencies, long waiting time and low satisfaction of taxi-hailing passengers. In this paper, we propose a new Mobile Taxi-hailing System (called MTS) based on optimal multi-taxi dispatch, which can be used by taxi service companies (TSCs). Different from the competition modes used in traditional taxi-hailing systems, MTS assigns vacant taxis to taxi-hailing passengers proactively. For the taxi dispatch problem in MTS, we define a system utility function, which involves the total net profits of taxis and waiting time of passengers. Moreover, in the utility function, we take into consideration the various classes of taxis with different resource configurations, and the cost associated with taxis' empty travel distances. Our goal is to maximize the system utility function, restricted by the individual net profits of taxis and the passengers' requirements for specified classes of taxis. To solve this problem, we design an optimal algorithm based on the idea of Kuhn-Munkres (called KMBA), and prove the correctness and optimality of the proposed algorithm. Additionally, we demonstrate the significant performances of our algorithm through extensive simulations.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":"310 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133114546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26