2016 45th International Conference on Parallel Processing (ICPP)最新文献

筛选
英文 中文
Parallel k-Means++ for Multiple Shared-Memory Architectures 面向多个共享内存架构的并行k- means++
2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-09-22 DOI: 10.1109/ICPP.2016.18
Patrick Mackey, R. Lewis
{"title":"Parallel k-Means++ for Multiple Shared-Memory Architectures","authors":"Patrick Mackey, R. Lewis","doi":"10.1109/ICPP.2016.18","DOIUrl":"https://doi.org/10.1109/ICPP.2016.18","url":null,"abstract":"In recent years k-means++ has become a popular initialization technique for improved k-means clustering. To date, most of the work done to improve its performance has involved parallelizing algorithms that are only approximations of k-means++. In this paper we present a parallelization of the exact k-means++ algorithm, with a proof of its correctness. We develop implementations for three distinct shared-memory architectures: multicore CPU, high performance GPU, and the massively multithreaded Cray XMT platform. We demonstrate the scalability of the algorithm on each platform. In addition we present a visual approach for showing which platform performed k-means++ the fastest for varying data sizes.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-09-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122632619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Optimizing GPU Register Usage: Extensions to OpenACC and Compiler Optimizations 优化GPU寄存器使用:扩展到OpenACC和编译器优化
2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.72
Xiaonan Tian, Dounia Khaldi, Deepak Eachempati, Rengan Xu, B. Chapman
{"title":"Optimizing GPU Register Usage: Extensions to OpenACC and Compiler Optimizations","authors":"Xiaonan Tian, Dounia Khaldi, Deepak Eachempati, Rengan Xu, B. Chapman","doi":"10.1109/ICPP.2016.72","DOIUrl":"https://doi.org/10.1109/ICPP.2016.72","url":null,"abstract":"Using compiler directives to program accelerator-based systems through APIs such as OpenACC or OpenMP has increasingly gained popularity due to the portability and productivity advantages it offers. However, when comparing the performance typically achieved to what lower-level programming interfaces such as CUDA or OpenCL provides, directive-based approaches may entail a significant performance penalty. To support massively parallel computations, accelerators such as GPGPUs offer an expansive set of registers, larger than even the L1 cache, to hold the temporary state of each thread. Scalar variables are the mostly likely candidates to be assigned to these registers by the compiler. Hence, scalar replacement is a key enabling optimization for effectively improving the utilization of register files on accelerator devices and thereby substantially reducing the cost of memory operations. However, the aggressive application of scalar replacement may require a large number of registers, limiting the application of this technique unless mitigating approaches such as those described in this paper are taken. In this paper, we propose solutions to optimize the register usage within offloaded computations using OpenACC directives. We first present a compiler optimization called SAFARA that extends the classical scalar replacement algorithm to improve register file utilization on GPUs. Moreover, we extend the OpenACC interface by providing new clauses, namely dim and small, that will reduce the number of scalars to replace. SAFARA prioritizes the most beneficial data for allocation in registers based on frequency of use and also memory access latency. It also uses a static feedback strategy to retrieve low-level register information in order to guide the compiler in carrying out the scalar replacement transformation. Then, the new clauses we propose will extremely reduce the number of scalars, eliminating the need for more registers. We evaluate SAFARA and the new clauses using SPEC and NAS OpenACC benchmarks, our results suggest that these approaches will be effective for improving overall performance of code executing on GPUs. We got up to 2.5 speedup running NAS and 2.08 speedup while running SPEC benchmarks.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127215328","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Run-Time Performance Estimation and Fairness-Oriented Scheduling Policy for Concurrent GPGPU Applications 并行GPGPU应用的运行时性能评估和面向公平的调度策略
2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.14
Qingda Hu, J. Shu, Jie Fan, Youyou Lu
{"title":"Run-Time Performance Estimation and Fairness-Oriented Scheduling Policy for Concurrent GPGPU Applications","authors":"Qingda Hu, J. Shu, Jie Fan, Youyou Lu","doi":"10.1109/ICPP.2016.14","DOIUrl":"https://doi.org/10.1109/ICPP.2016.14","url":null,"abstract":"In order to satisfy the competition of multiple GPU accelerated applications and make full use of GPU resources, a lot of previous works propose spatial-multitasking to execute multiple GPGPU applications simultaneously on a single GPU device. However, when adopting the spatial-multitasking framework, the inter-application interference may slow down different applications differently, leading to the unreasonable allocation of shared resources among concurrent GPGPU applications, degrading system fairness severely and resulting in sub-optimal performance. Thus, it is imperative to develop mechanisms to control negative inter-application interactions and utilize shared resources fairly and efficiently. Quantitatively estimating application slowdowns can enable us to accurately minimize system unfairness. Although several previous works pay attention on showdown estimation for CPUs, we find that they may be inaccurate for GPUs. Therefore, we propose a novel Dynamical Application Slowdown Estimation (DASE) model to estimate application slowdowns accurately. Our evaluations show that DASE has significantly lower estimation error (only 8.8%) than the state-of-the-art estimation models (36.3% and 32.8%) across all two-application workloads. Furthermore, to verify the effectiveness of our DASE model, we leverage our model to develop an efficient fairness-oriented Streaming Multiprocessors (SM) allocation policy DASE-Fair to minimize the overall system unfairness. Compared to the even SM partition policy, DASE-Fair improves fairness dramatically by more than 16.1% on average.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127043041","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
The Case for Cross-Component Power Coordination on Power Bounded Systems 功率有界系统的跨组件功率协调问题
2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.66
Rong Ge, Xizhou Feng, Yangyang He, Pengfei Zou
{"title":"The Case for Cross-Component Power Coordination on Power Bounded Systems","authors":"Rong Ge, Xizhou Feng, Yangyang He, Pengfei Zou","doi":"10.1109/ICPP.2016.66","DOIUrl":"https://doi.org/10.1109/ICPP.2016.66","url":null,"abstract":"Modern computer systems are increasingly bounded by the available or permissible power at multiple layers, ranging from a single chip to an entire data center. To cope with this reality, it is necessary to understand how power bounds impact the design and performance of emergent computer systems. In this paper, we study the problem of coordinated power allocation between processors and memory modules on power-bounded systems. We experimentally and analytically investigate the dynamics between cross-component power allocation and application performance, identify the patterns of power allocation scenarios, and develop optimal power allocation methods. In our study, we discover that (1) different applications share categorical patterns with regard to how power allocations among individual components impact application performance and actual power, (2) the per-node power budget must exceed a certain threshold in order to achieve desirable performance and efficiency, (3) there exist workload-specific optimal power allocations under a given power budget and such optimal power coordination can be pinpointed using the heuristics derived from the categorical patterns and a light-weight power-performance profiling. Results from this study demonstrate the importance and feasibility of cross-component coordination to the implementation of power-bound high performance computing technology.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129642378","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Randomly Optimized Grid Graph for Low-Latency Interconnection Networks 低延迟互联网络随机优化网格图
2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.46
K. Nakano, Daisuke Takafuji, S. Fujita, Hiroki Matsutani, I. Fujiwara, M. Koibuchi
{"title":"Randomly Optimized Grid Graph for Low-Latency Interconnection Networks","authors":"K. Nakano, Daisuke Takafuji, S. Fujita, Hiroki Matsutani, I. Fujiwara, M. Koibuchi","doi":"10.1109/ICPP.2016.46","DOIUrl":"https://doi.org/10.1109/ICPP.2016.46","url":null,"abstract":"In this work we present randomly optimized grid graphs that maximize the performance measure, such as diameter and average shortest path length (ASPL), with subject to limited edge length on a grid surface. We also provide theoretical lower bounds of the diameter and the ASPL, which prove optimality of our randomly optimized grid graphs. We further present a diagonal grid layout that significantly reduces the diameter compared to the conventional one under the edge-length limitation. We finally show their applications to three case studies of off-and on-chip interconnection networks. Our design efficiently improves their performance measures, such as end-to-end communication latency, network power consumption, cost, and execution time of parallel benchmarks.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130633075","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Programming Techniques for the Automata Processor 自动机处理器的编程技术
2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.30
Indranil Roy, Ankit Srivastava, S. Aluru
{"title":"Programming Techniques for the Automata Processor","authors":"Indranil Roy, Ankit Srivastava, S. Aluru","doi":"10.1109/ICPP.2016.30","DOIUrl":"https://doi.org/10.1109/ICPP.2016.30","url":null,"abstract":"The Micron Automata Processor (AP) is a novel co-processor accelerator that supports the parallel execution of multiple Nondeterministic Finite Automata (NFA) programmed directly into hardware over a single data-stream. In this paper, we present a number of programming techniques to develop automata that execute efficiently on this processor. First, we present general techniques to transform NFAs defined in their classical representation to the representation used by the AP, and optimize the same. Then, we present automata development techniques using simple but powerful generic building blocks. All the above techniques are generic in nature and can be useful to application developers working on this new upcoming co-processor architecture.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127624964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
One-Sided Interface for Matrix Operations Using MPI-3 RMA: A Case Study with Elemental 使用MPI-3 RMA的矩阵操作的单边接口:一个与Elemental的案例研究
2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.28
Sayan Ghosh, J. Hammond, Antonio J. Peña, P. Balaji, A. Gebremedhin, B. Chapman
{"title":"One-Sided Interface for Matrix Operations Using MPI-3 RMA: A Case Study with Elemental","authors":"Sayan Ghosh, J. Hammond, Antonio J. Peña, P. Balaji, A. Gebremedhin, B. Chapman","doi":"10.1109/ICPP.2016.28","DOIUrl":"https://doi.org/10.1109/ICPP.2016.28","url":null,"abstract":"A one-sided programming model separates communication from synchronization, and is the driving principle behind partitioned global address space (PGAS) libraries such as Global Arrays (GA) and SHMEM. PGAS models expose a rich set of functionality that a developer needs in order to implement mathematical algorithms that require frequent multidimensional array accesses. However, use of existing PGAS libraries in application codes often requires significant development effort in order to fully exploit these programming models. On the other hand, a vast majority of scientific codes use MPI either directly or indirectly via third-party scientific computation libraries, and need features to support application-specific communication requirements (e.g., asynchronous update of distributed sparse matrices, commonly arising in machine learning workloads). For such codes it is often impractical to completely shift programming models in favor of special one-sided communication middleware. Instead, an elegant and productive solution is to exploit the one-sided functionality already offered by MPI-3 RMA (Remote Memory Access). We designed a general one-sided interface using the MPI-3 passive RMA model for remote matrix operations in the linear algebra library Elemental, we call the interface we designed RMAInterface. Elemental is an open source library for distributed-memory dense and sparse linear algebra and optimization. We employ RMAInterface to construct a Global Arrays-like API and demonstrate its performance scalability and competitivity with that of the existing GA (with ARMCI-MPI) for a quantum chemistry application.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126322389","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
A Parallel Hill-Climbing Refinement Algorithm for Graph Partitioning 图划分的并行爬坡细化算法
2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.34
Dominique LaSalle, G. Karypis
{"title":"A Parallel Hill-Climbing Refinement Algorithm for Graph Partitioning","authors":"Dominique LaSalle, G. Karypis","doi":"10.1109/ICPP.2016.34","DOIUrl":"https://doi.org/10.1109/ICPP.2016.34","url":null,"abstract":"Graph partitioning is important in distributing workloads on parallel compute systems, computing sparse matrix re-orderings, and designing VLSI circuits. Refinement algorithms are used to improve existing partitionings, and are essential for obtaining high-quality partitionings. Existing parallel refinement algorithms either extract concurrency by sacrificing in terms of quality, or preserve quality by restricting concurrency. In this work we present a new shared-memory parallel algorithm for refining an existing k-way partitioning that can break out of local minima and produce high-quality partitionings. This allows our algorithm to scale well in terms of the number of processing cores and produce clusterings of quality equal to serial algorithms. Our algorithm achieves speedups of 5.7 - 16.7× using 24 cores, while exhibiting only 0.52% higher edgecuts than when run serially. This is 6.3× faster and 1.9% better quality than other parallel refinement algorithms which can break out of local minima.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130439792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 40
An Unbounded Nonblocking Double-Ended Queue 一种无界非阻塞双端队列
2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.32
Matthew Graichen, Joseph Izraelevitz, M. Scott
{"title":"An Unbounded Nonblocking Double-Ended Queue","authors":"Matthew Graichen, Joseph Izraelevitz, M. Scott","doi":"10.1109/ICPP.2016.32","DOIUrl":"https://doi.org/10.1109/ICPP.2016.32","url":null,"abstract":"We introduce a new algorithm for an unbounded concurrent double-ended queue (deque). Like the bounded deque of Herlihy, Luchangco, and Moir on which it is based, the new algorithm is simple and obstruction free, has no pathological long-latency scenarios, avoids interference between operations at opposite ends, and requires no special hardware support beyond the usual compare-and-swap. To the best of our knowledge, no prior concurrent deque combines these properties with unbounded capacity, or provides consistently better performance across a wide range of concurrent workloads.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133623689","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Optimal Multi-taxi Dispatch for Mobile Taxi-Hailing Systems 移动叫车系统的最优多车调度
2016 45th International Conference on Parallel Processing (ICPP) Pub Date : 2016-08-01 DOI: 10.1109/ICPP.2016.41
Guoju Gao, Mingjun Xiao, Zhenhua Zhao
{"title":"Optimal Multi-taxi Dispatch for Mobile Taxi-Hailing Systems","authors":"Guoju Gao, Mingjun Xiao, Zhenhua Zhao","doi":"10.1109/ICPP.2016.41","DOIUrl":"https://doi.org/10.1109/ICPP.2016.41","url":null,"abstract":"Traditional taxi-hailing systems through wireless networks in metropolitan areas allow taxis to compete for passengers chaotically and accidentally, which generally result in inefficiencies, long waiting time and low satisfaction of taxi-hailing passengers. In this paper, we propose a new Mobile Taxi-hailing System (called MTS) based on optimal multi-taxi dispatch, which can be used by taxi service companies (TSCs). Different from the competition modes used in traditional taxi-hailing systems, MTS assigns vacant taxis to taxi-hailing passengers proactively. For the taxi dispatch problem in MTS, we define a system utility function, which involves the total net profits of taxis and waiting time of passengers. Moreover, in the utility function, we take into consideration the various classes of taxis with different resource configurations, and the cost associated with taxis' empty travel distances. Our goal is to maximize the system utility function, restricted by the individual net profits of taxis and the passengers' requirements for specified classes of taxis. To solve this problem, we design an optimal algorithm based on the idea of Kuhn-Munkres (called KMBA), and prove the correctness and optimality of the proposed algorithm. Additionally, we demonstrate the significant performances of our algorithm through extensive simulations.","PeriodicalId":409991,"journal":{"name":"2016 45th International Conference on Parallel Processing (ICPP)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133114546","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术官方微信