2014 IEEE 28th International Parallel and Distributed Processing Symposium最新文献_第2页

An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data 非规则数据的高效GPU通用稀疏矩阵-矩阵乘法

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.47

Weifeng Liu, B. Vinter

{"title":"An Efficient GPU General Sparse Matrix-Matrix Multiplication for Irregular Data","authors":"Weifeng Liu, B. Vinter","doi":"10.1109/IPDPS.2014.47","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.47","url":null,"abstract":"General sparse matrix-matrix multiplication (SpGEMM) is a fundamental building block for numerous applications such as algebraic multigrid method, breadth first search and shortest path problem. Compared to other sparse BLAS routines, an efficient parallel SpGEMM algorithm has to handle extra irregularity from three aspects: (1) the number of the nonzero entries in the result sparse matrix is unknown in advance, (2) very expensive parallel insert operations at random positions in the result sparse matrix dominate the execution time, and (3) load balancing must account for sparse data in both input matrices. Recent work on GPU SpGEMM has demonstrated rather good both time and space complexity, but works best for fairly regular matrices. In this work we present a GPU SpGEMM algorithm that particularly focuses on the above three problems. Memory pre-allocation for the result matrix is organized by a hybrid method that saves a large amount of global memory space and efficiently utilizes the very limited on-chip scratchpad memory. Parallel insert operations of the nonzero entries are implemented through the GPU merge path algorithm that is experimentally found to be the fastest GPU merge approach. Load balancing builds on the number of the necessary arithmetic operations on the nonzero entries and is guaranteed in all stages. Compared with the state-of-the-art GPU SpGEMM methods in the CUSPARSE library and the CUSP library and the latest CPU SpGEMM method in the Intel Math Kernel Library, our approach delivers excellent absolute performance and relative speedups on a benchmark suite composed of 23 matrices with diverse sparsity structures.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127402338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 94

PAGE: A Framework for Easy PArallelization of GEnomic Applications PAGE:基因组应用的简单并行化框架

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.19

Mucahid Kutlu, G. Agrawal

{"title":"PAGE: A Framework for Easy PArallelization of GEnomic Applications","authors":"Mucahid Kutlu, G. Agrawal","doi":"10.1109/IPDPS.2014.19","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.19","url":null,"abstract":"With the availability of high-throughput and low-cost sequencing technologies, an increasing amount of genetic data is becoming available to researchers. There is clearly a potential for significant new scientific and medical advances by analysis of such data, however, it is imperative to exploit parallelism and achieve effective utilization of the computing resources to be able to handle massive datasets. Thus, frameworks that can help researchers develop parallel applications without dealing with low-level details of parallel coding are very important for advances in genetic research. In this study, we develop a middleware, PAGE, which supports 'map reduce-like' processing, but with significant differences from a system like Hadoop, to be useful and effective for parallelizing analysis of genomic data. Particularly, it can work with map functions written in any language, thus allowing utilization of existing serial tools (even those for which only an executable is available) as map functions. Thus, it can greatly simplify parallel application development for scenarios where complex data formats and/or nuanced serial algorithms are involved, as is often the case for genomic data. It allows parallelization by partitioning by-locus or partitioning by-chromosome, provides different scheduling schemes, and execution models, to match the nature of algorithms common in genetic research. We have evaluated the middleware system using four popular genomic applications, including VarScan, Unified Genotyper, Realigner Target Creator, and Indel Realigner, and compared the achieved performance against with two popular frameworks (Hadoop and GATK). We show that our middleware outperforms GATK and Hadoop and it is able to achieve high parallel efficiency and scalability.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114811195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

A Spatio-temporal Coupling Method to Reduce the Time-to-Solution of Cardiovascular Simulations 一种缩短心血管仿真求解时间的时空耦合方法

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.68

A. Randles, E. Kaxiras

引用次数: 5

F2C2-STM: Flux-Based Feedback-Driven Concurrency Control for STMs F2C2-STM:基于通量的反馈驱动stm并发控制

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.99

K. Ravichandran, S. Pande

{"title":"F2C2-STM: Flux-Based Feedback-Driven Concurrency Control for STMs","authors":"K. Ravichandran, S. Pande","doi":"10.1109/IPDPS.2014.99","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.99","url":null,"abstract":"Software Transactional Memory (STM) systems provide an easy to use programming model for concurrent code and have been found suitable for parallelizing many applications providing performance gains with minimal programmer effort. With increasing core counts on modern processors one would expect increasing benefits. However, we observe that running STM applications on higher core counts is sometimes, in fact, detrimental to performance. This is due to the larger number of conflicts that arise with a larger number of parallel cores. As the number of cores available on processors steadily rise, a larger number of applications are beginning to exhibit these characteristics. In this paper we propose a novel dynamic concurrency control technique which can significantly improve performance (up to 50%) as well as resource utilization (up to 85%) for these applications at higher core counts. Our technique uses ideas borrowed from TCP's network congestion control algorithm and uses self-induced concurrency fluctuations to dynamically monitor and match varying concurrency levels in applications while minimizing global synchronization. Our flux-based feedback-driven concurrency control technique is capable of fully recovering the performance of the best statically chosen concurrency specification (as chosen by an oracle) regardless of the initial specification for several real world applications. Further, our technique can actually improve upon the performance of the oracle chosen specification by more than 10% for certain applications through dynamic adaptation to available parallelism. We demonstrate our approach on the STAMP benchmark suite while reporting significant performance and resource utilization benefits. We also demonstrate significantly better performance when comparing against state of the art concurrency control and scheduling techniques. Further, our technique is programmer friendly as it requires no changes to application code and no offline phases.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117043505","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Balancing CPU-GPU Collaborative High-Order CFD Simulations on the Tianhe-1A Supercomputer 在天河1a超级计算机上平衡CPU-GPU协同高阶CFD仿真

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.80

Chuanfu Xu, Lilun Zhang, Xiaogang Deng, Jianbin Fang, Guang-Xiong Wang, Wei Cao, Yonggang Che, Yongxian Wang, Wei Liu

{"title":"Balancing CPU-GPU Collaborative High-Order CFD Simulations on the Tianhe-1A Supercomputer","authors":"Chuanfu Xu, Lilun Zhang, Xiaogang Deng, Jianbin Fang, Guang-Xiong Wang, Wei Cao, Yonggang Che, Yongxian Wang, Wei Liu","doi":"10.1109/IPDPS.2014.80","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.80","url":null,"abstract":"HOSTA is an in-house high-order CFD software that can simulate complex flows with complex geometries. Large scale high-order CFD simulations using HOSTA require massive HPC resources, thus motivating us to port it onto modern GPU accelerated supercomputers like Tianhe-1A. To achieve a greater speedup and fully tap the potential of Tianhe-1A, we collaborate CPU and GPU for HOSTA instead of using a naive GPU-only approach. We present multiple novel techniques to balance the loads between the store-poor GPU and the store-rich CPU, and overlap the collaborative computation and communication as far as possible. Taking CPU and GPU load balance into account, we improve the maximum simulation problem size per Tianhe-1A node for HOSTA by 2.3X, meanwhile the collaborative approach can improve the performance by around 45% compared to the GPU-only approach. Scalability tests show that HOSTA can achieve a parallel efficiency of above 60% on 1024 Tianhe-1A nodes. With our method, we have successfully simulated China's large civil airplane configuration C919 containing 150M grid cells. To our best knowledge, this is the first paper that reports a CPUGPU collaborative high-order accurate aerodynamic simulation result with such a complex grid geometry.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"108 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117282724","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

A Case for a Flexible Scalar Unit in SIMT Architecture SIMT体系结构中柔性标量单元的一种情况

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.21

Yi Yang, Ping Xiang, Mike Mantor, Norman Rubin, Lisa R. Hsu, Qunfeng Dong, Huiyang Zhou

{"title":"A Case for a Flexible Scalar Unit in SIMT Architecture","authors":"Yi Yang, Ping Xiang, Mike Mantor, Norman Rubin, Lisa R. Hsu, Qunfeng Dong, Huiyang Zhou","doi":"10.1109/IPDPS.2014.21","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.21","url":null,"abstract":"The wide availability and the Single-Instruction Multiple-Thread (SIMT)-style programming model have made graphics processing units (GPUs) a promising choice for high performance computing. However, because of the SIMT style processing, an instruction will be executed in every thread even if the operands are identical for all the threads. To overcome this inefficiency, the AMD's latest Graphics Core Next (GCN) architecture integrates a scalar unit into a SIMT unit. In GCN, both the SIMT unit and the scalar unit share a single SIMT style instruction stream. Depending on its type, an instruction is issued to either a scalar or a SIMT unit. In this paper, we propose to extend the scalar unit so that it can either share the instruction stream with the SIMT unit or execute a separate instruction stream. The program to be executed by the scalar unit is referred to as a scalar program and its purpose is to assist SIMT-unit execution. The scalar programs are either generated from SIMT programs automatically by the compiler or manually developed by expert developers. We make a case for our proposed flexible scalar unit through three collaborative execution paradigms: data prefetching, control divergence elimination, and scalar-workload extraction. Our experimental results show that significant performance gains can be achieved using our proposed approaches compared to the state-of-art SIMT style processing.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128606805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

MobiStreams: A Reliable Distributed Stream Processing System for Mobile Devices 移动流:一个可靠的分布式流处理系统的移动设备

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.17

Huayong Wang, L. Peh

{"title":"MobiStreams: A Reliable Distributed Stream Processing System for Mobile Devices","authors":"Huayong Wang, L. Peh","doi":"10.1109/IPDPS.2014.17","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.17","url":null,"abstract":"Multi-core phones are now pervasive. Yet, existing applications rely predominantly on a client-server computing paradigm, using phones only as thin clients, sending sensed information via the cellular network to servers for processing. This makes the cellular network the bottleneck, limiting overall application performance. In this paper, we propose Mobi Streams, a Distributed Stream Processing System (DSPS) that runs directly on smartphones. Mobi Streams can offload computing from remote servers to local phones and thus alleviate the pressure on the cellular network. Implementing DSPS on smartphones faces significant challenges: 1) multiple phones can readily fail simultaneously, and 2) the phones' ad-hoc WiFi network has low bandwidth. Mobi Streams tackles these challenges through two new techniques: 1) token-triggered check pointing, and 2) broadcast-based check pointing. Our evaluations driven by two real world applications deployed in the US and Singapore show that migrating from a server platform to a smartphone platform eliminates the cellular network bottleneck, leading to 0.78~42.6X throughput increase and 10%~94.8% latency decrease. Also, Mobi Streams' fault tolerance scheme increases throughput by 230% and reduces latency by 40% vs. prior state-of-the-art fault-tolerant DSPSs.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130873994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

Identifying Code Phases Using Piece-Wise Linear Regressions 使用分段线性回归识别代码阶段

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.100

Harald Servat, Germán Llort, Juan Gonzalez, Judit Giménez, Jesús Labarta

{"title":"Identifying Code Phases Using Piece-Wise Linear Regressions","authors":"Harald Servat, Germán Llort, Juan Gonzalez, Judit Giménez, Jesús Labarta","doi":"10.1109/IPDPS.2014.100","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.100","url":null,"abstract":"Node-level performance is one of the factors that may limit applications from reaching the supercomputers' peak performance. Studying node-level performance and attributing it to the source code results into valuable insight that can be used to improve the application efficiency, albeit performing such a study may be an intimidating task due to the complexity and size of the applications. We present in this paper a mechanism that takes advantage of combining piece-wise linear regressions, coarse-grain sampling, and minimal instrumentation to detect performance phases in the computation regions even if their granularity is very fine. This mechanism then maps the performance of each phase into the application syntactical structure displaying a correlation between performance and source code. We introduce a methodology on top of this mechanism to describe the node-level performance of parallel applications, even for first-time seen applications. Finally, we demonstrate the methodology describing optimized in-production applications and further improving their performance applying small transformations to the code based on the hints discovered.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"41 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121006223","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Victim Selection and Distributed Work Stealing Performance: A Case Study 受害者选择和分布式工作盗窃行为:一个案例研究

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.74

Swann Perarnau, M. Sato

{"title":"Victim Selection and Distributed Work Stealing Performance: A Case Study","authors":"Swann Perarnau, M. Sato","doi":"10.1109/IPDPS.2014.74","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.74","url":null,"abstract":"Work stealing is a popular solution to perform dynamic load balancing of irregular computations, both for shared memory and distributed memory systems. While shared memory performance of work stealing is well understood, distributing this algorithm to several thousands of nodes can introduce new performance issues. In particular, most studies of work stealing assume that all participating processes are equidistant from each other, in terms of communication latency. This paper presents a new performance evaluation of the popular UTS benchmark, in its work stealing implementation, on the scale of ten thousands of compute nodes. Taking advantage of the physical scale of the K Computer, we investigate in details the performance impact of communication latencies on work stealing. In particular, we introduce a new performance metric to assess the time needed by the work stealing scheduler to distribute work among all processes. Using this metric, we identify a previously overlooked issue: the victim selection function used by the work stealing application can severely impact its performance at large scale. To solve this issue, we introduce a new strategy taking into account the physical distance between nodes and achieve significant performance improvements.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125094273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Improved Time Bounds for Linearizable Implementations of Abstract Data Types 抽象数据类型线性化实现的改进时间界限

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.77

Jiaqi Wang, Edward Talmage, Hyunyoung Lee, J. Welch

引用次数: 9