2014 IEEE 28th International Parallel and Distributed Processing Symposium最新文献_第4页

Optimizing Sparse Matrix-Multiple Vectors Multiplication for Nuclear Configuration Interaction Calculations 核构型相互作用计算的稀疏矩阵-多向量乘法优化

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.125

H. Aktulga, A. Buluç, Samuel Williams, Chao Yang

{"title":"Optimizing Sparse Matrix-Multiple Vectors Multiplication for Nuclear Configuration Interaction Calculations","authors":"H. Aktulga, A. Buluç, Samuel Williams, Chao Yang","doi":"10.1109/IPDPS.2014.125","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.125","url":null,"abstract":"Obtaining highly accurate predictions on the properties of light atomic nuclei using the configuration interaction (CI) approach requires computing a few extremal Eigen pairs of the many-body nuclear Hamiltonian matrix. In the Many-body Fermion Dynamics for nuclei (MFDn) code, a block Eigen solver is used for this purpose. Due to the large size of the sparse matrices involved, a significant fraction of the time spent on the Eigen value computations is associated with the multiplication of a sparse matrix (and the transpose of that matrix) with multiple vectors (SpMM and SpMM_T). Existing implementations of SpMM and SpMM_T significantly underperform expectations. Thus, in this paper, we present and analyze optimized implementations of SpMM and SpMM_T. We base our implementation on the compressed sparse blocks (CSB) matrix format and target systems with multi-core architectures. We develop a performance model that allows us to understand and estimate the performance characteristics of our SpMM kernel implementations, and demonstrate the efficiency of our implementation on a series of real-world matrices extracted from MFDn. In particular, we obtain 3-4 speedup on the requisite operations over good implementations based on the commonly used compressed sparse row (CSR) matrix format. The improvements in the SpMM kernel suggest we may attain roughly a 40% speed up in the overall execution time of the block Eigen solver used in MFDn.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"142 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116187419","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 75

Improving Communication Performance and Scalability of Native Applications on Intel Xeon Phi Coprocessor Clusters 改进Intel Xeon Phi协处理器集群上本地应用程序的通信性能和可扩展性

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.113

K. Vaidyanathan, K. Pamnany, Dhiraj D. Kalamkar, A. Heinecke, M. Smelyanskiy, Jongsoo Park, Daehyun Kim, Aniruddha G. Shet, Bharat Kaul, B. Joó, P. Dubey

{"title":"Improving Communication Performance and Scalability of Native Applications on Intel Xeon Phi Coprocessor Clusters","authors":"K. Vaidyanathan, K. Pamnany, Dhiraj D. Kalamkar, A. Heinecke, M. Smelyanskiy, Jongsoo Park, Daehyun Kim, Aniruddha G. Shet, Bharat Kaul, B. Joó, P. Dubey","doi":"10.1109/IPDPS.2014.113","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.113","url":null,"abstract":"Intel Xeon Phi coprocessor-based clusters offer high compute and memory performance for parallel workloads and also support direct network access. Many real world applications are significantly impacted by network characteristics and to maximize the performance of such applications on these clusters, it is particularly important to effectively saturate network bandwidth and/or hide communications latency. We demonstrate how to do so using techniques such as pipelined DMAs for data transfer, dynamic chunk sizing, and better asynchronous progress. We also show a method for, and the impact of avoiding serialization and maximizing parallelism during application communication phases. Additionally, we apply application optimizations focused on balancing computation and communication in order to hide communication latency and improve utilization of cores and of network bandwidth. We demonstrate the impact of our techniques on three well known and highly optimized HPC kernels running natively on the Intel Xeon Phi coprocessor. For the Wilson-Dslash operator from Lattice QCD, we characterize the improvements from each of our optimizations for communication performance, apply our method for maximizing concurrency during communication phases, and show an overall 48% improvement from our previously best published result. For HPL/LINPACK, we show 68.5% efficiency with 97 TFLOPs on 128 Intel Xeon Phi coprocessors, the first ever reported native HPL efficiency on a coprocessor-based supercomputer. For FFT, we show 10.8 TFLOPs using 1024 Intel Xeon Phi coprocessors on the TACC Stampede cluster, the highest reported performance on any Intel Architecture-based cluster and the first such result to be reported on a coprocessor-based supercomputer.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114858771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

CALCioM: Mitigating I/O Interference in HPC Systems through Cross-Application Coordination 通过跨应用程序协调减少高性能计算系统中的I/O干扰

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.27

Matthieu Dorier, Gabriel Antoniu, R. Ross, D. Kimpe, Shadi Ibrahim

{"title":"CALCioM: Mitigating I/O Interference in HPC Systems through Cross-Application Coordination","authors":"Matthieu Dorier, Gabriel Antoniu, R. Ross, D. Kimpe, Shadi Ibrahim","doi":"10.1109/IPDPS.2014.27","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.27","url":null,"abstract":"Unmatched computation and storage performance in new HPC systems have led to a plethora of I/O optimizations ranging from application-side collective I/O to network and disk-level request scheduling on the file system side. As we deal with ever larger machines, the interference produced by multiple applications accessing a shared parallel file system in a concurrent manner becomes a major problem. Interference often breaks single-application I/O optimizations, dramatically degrading application I/O performance and, as a result, lowering machine wide efficiency. This paper focuses on CALCioM, a framework that aims to mitigate I/O interference through the dynamic selection of appropriate scheduling policies. CALCioM allows several applications running on a supercomputer to communicate and coordinate their I/O strategy in order to avoid interfering with one another. In this work, we examine four I/O strategies that can be accommodated in this framework: serializing, interrupting, interfering and coordinating. Experiments on Argonne's BG/P Surveyor machine and on several clusters of the French Grid'5000 show how CALCioM can be used to efficiently and transparently improve the scheduling strategy between two otherwise interfering applications, given specified metrics of machine wide efficiency.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121895244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 110

Active Measurement of the Impact of Network Switch Utilization on Application Performance 网络交换机利用率对应用性能影响的主动测量

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.28

Marc Casas, G. Bronevetsky

{"title":"Active Measurement of the Impact of Network Switch Utilization on Application Performance","authors":"Marc Casas, G. Bronevetsky","doi":"10.1109/IPDPS.2014.28","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.28","url":null,"abstract":"Inter-node networks are a key capability of High-Performance Computing (HPC) systems that differentiates them from less capable classes of machines. However, in spite of their very high performance, the increasing computational power of HPC compute nodes and the associated rise in application communication needs make network performance a common performance bottleneck. To achieve high performance in spite of network limitations application developers require tools to measure their applications' network utilization and inform them about how the network's communication capacity relates to the performance of their applications. This paper presents a new performance measurement and analysis methodology based on empirical measurements of network behavior. Our approach uses two benchmarks that inject extra network communication. The first probes the fraction of the network that is utilized by a software component (an application or an individual task) to determine the existence and severity of network contention. The second aggressively injects network traffic while a software component runs to evaluate its performance on less capable networks or when it shares the network with other software components. We then combine the information from the two types of experiments to predict the performance slowdown experienced by multiple software components (e.g. multiple processes of a single MPI application) when they share a single network. Our methodology is applied to individual network switches and demonstrated taking 6 representative HPC applications and predicting the performance slowdowns of the 36 possible application pairs. The average error of our predictions is less than 10%.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"154 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121506055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

Work-Efficient Parallel GPU Methods for Single-Source Shortest Paths 单源最短路径的高效并行GPU方法

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.45

A. Davidson, S. Baxter, M. Garland, John Douglas Owens

{"title":"Work-Efficient Parallel GPU Methods for Single-Source Shortest Paths","authors":"A. Davidson, S. Baxter, M. Garland, John Douglas Owens","doi":"10.1109/IPDPS.2014.45","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.45","url":null,"abstract":"Finding the shortest paths from a single source to all other vertices is a fundamental method used in a variety of higher-level graph algorithms. We present three parallel friendly and work-efficient methods to solve this Single-Source Shortest Paths (SSSP) problem: Work front Sweep, Near-Far and Bucketing. These methods choose different approaches to balance the trade off between saving work and organizational overhead. In practice, all of these methods do much less work than traditional Bellman-Ford methods, while adding only a modest amount of extra work over serial methods. These methods are designed to have a sufficient parallel workload to fill modern massively-parallel machines, and select reorganizational schemes that map well to these architectures. We show that in general our Near-Far method has the highest performance on modern GPUs, outperforming other parallel methods. We also explore a variety of parallel load-balanced graph traversal strategies and apply them towards our SSSP solver. Our work-saving methods always outperform a traditional GPU Bellman-Ford implementation, achieving rates up to 14x higher on low-degree graphs and 340x higher on scale free graphs. We also see significant speedups (20-60x) when compared against a serial implementation on graphs with adequately high degree.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125688699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 176

A Step towards Energy Efficient Computing: Redesigning a Hydrodynamic Application on CPU-GPU 迈向节能计算的一步:在CPU-GPU上重新设计流体力学应用程序

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.103

Tingxing Dong, V. Dobrev, T. Kolev, R. Rieben, S. Tomov, J. Dongarra

{"title":"A Step towards Energy Efficient Computing: Redesigning a Hydrodynamic Application on CPU-GPU","authors":"Tingxing Dong, V. Dobrev, T. Kolev, R. Rieben, S. Tomov, J. Dongarra","doi":"10.1109/IPDPS.2014.103","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.103","url":null,"abstract":"Power and energy consumption are becoming an increasing concern in high performance computing. Compared to multi-core CPUs, GPUs have a much better performance per watt. In this paper we discuss efforts to redesign the most computation intensive parts of BLAST, an application that solves the equations for compressible hydrodynamics with high order finite elements, using GPUs BLAST, Dobrev. In order to exploit the hardware parallelism of GPUs and achieve high performance, we implemented custom linear algebra kernels. We intensively optimized our CUDA kernels by exploiting the memory hierarchy, which exceed the vendor's library routines substantially in performance. We proposed an auto tuning technique to adapt our CUDA kernels to the orders of the finite element method. Compared to a previous base implementation, our redesign and optimization lowered the energy consumption of the GPU in two aspects: 60% less time to solution and 10% less power required. Compared to the CPU-only solution, our GPU accelerated BLAST obtained a 2.5× overall speedup and 1.42× energy efficiency (green up) using 4th order (Q_4) finite elements, and a 1.9× speedup and 1.27× green up using 2nd order (Q2) finite elements.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"79 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133930004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 60

A Multi-core Parallel Branch-and-Bound Algorithm Using Factorial Number System 基于阶乘数系统的多核并行分支定界算法

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.124

M. Mezmaz, Rudi Leroy, N. Melab, D. Tuyttens

{"title":"A Multi-core Parallel Branch-and-Bound Algorithm Using Factorial Number System","authors":"M. Mezmaz, Rudi Leroy, N. Melab, D. Tuyttens","doi":"10.1109/IPDPS.2014.124","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.124","url":null,"abstract":"Many real-world problems in different industrial and economic fields are permutation combinatorial optimization problems. Solving to optimality large instances of these problems, such as flowshop problem, is a challenge for multi-core computing. This paper proposes a multi-threaded factoradic-based branch-and-bound algorithm to solve permutation combinatorial problems on multi-core processors. The factoradic, called also factorial number system, is a mixed radix numeral system adapted to numbering permutations. In this new parallel algorithm, the B&B is based on a matrix of integers instead of a pool of permutations, and work units exchanged between threads are intervals of factoradics instead of sets of nodes. Compared to a conventional pool-based approach, the obtained results on flowshop instances demonstrate that our new factoradic-based approach, on average, uses about 60 times less memory to store the pool of subproblems, generates about 1.3 times less page faults, waits about 7 times less time to synchronize the access to the pool, requires about 9 times less CPU time to manage this pool, and performs about 30,000 times less context switches.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"185 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133032014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

Pythia: Faster Big Data in Motion through Predictive Software-Defined Network Optimization at Runtime Pythia:在运行时通过预测性软件定义的网络优化实现更快的动态大数据

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.20

M. V. Neves, C. Rose, K. Katrinis, H. Franke

{"title":"Pythia: Faster Big Data in Motion through Predictive Software-Defined Network Optimization at Runtime","authors":"M. V. Neves, C. Rose, K. Katrinis, H. Franke","doi":"10.1109/IPDPS.2014.20","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.20","url":null,"abstract":"The rise of Internet of Things sensors, social networking and mobile devices has led to an explosion of available data. Gaining insights into this data has led to the area of Big Data analytics. The MapReduce framework, as implemented in Hadoop, is one of the most popular frameworks for Big Data analysis. To handle the ever-increasing data size, Hadoop is a scalable framework that allows dedicated, seemingly unbound numbers of servers to participate in the analytics process. Response time of an analytics request is an important factor for time to value/insights. While the compute and disk I/O requirements can be scaled with the number of servers, scaling the system leads to increased network traffic. Arguably, the communication-heavy phase of MapReduce contributes significantly to the overall response time, the problem is further aggravated, if communication patterns are heavily skewed, as is not uncommon in many MapReduce workloads. In this paper we present a system that reduces the skew impact by transparently predicting data communication volume at runtime and mapping the many end-to-end flows among the various processes to the underlying network, using emerging software-defined networking technologies to avoid hotspots in the network. Dependent on the network oversubscription ratio, we demonstrate reduction in job completion time between 3% and 46% for popular MapReduce benchmarks like Sort and Nutch.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133203545","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 38

Effectively Exploiting Parallel Scale for All Problem Sizes in LU Factorization 在LU分解中有效地利用了所有问题规模的并行尺度

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.109

Md Rakib Hasan, R. C. Whaley

{"title":"Effectively Exploiting Parallel Scale for All Problem Sizes in LU Factorization","authors":"Md Rakib Hasan, R. C. Whaley","doi":"10.1109/IPDPS.2014.109","DOIUrl":"https://doi.org/10.1109/IPDPS.2014.109","url":null,"abstract":"LU factorization is one of the most widely-used methods for solving linear equations, and thus its performance underlies a broad range of scientific computing. As architectural trends have replaced clock rate improvements with increases in parallel scale, library writers have responded by using tiled algorithms, where operand size is constrained in order to maximize parallelism, as seen in the well-known PLASMA library. This approach has two main drawbacks: (1) asymptotic performance is reduced due to limited operand size, (2) performance of small to medium sized problems is reduced due to unnecessary data motion in the parallel caches. In this paper we introduce a new approach where asymptotic performance is maximized by using special low-overhead kernel primitives that are auto-generated by the ATLAS framework, while unnecessary cache motion is minimized by using explicit cache management. We show that this technique can outperform all known libraries at all problem sizes on commodity parallel Intel and AMD platforms, with asymptotic LU performance of roughly 91% of hardware theoretical peak for a 12-core Intel Xeon, and 87% for a 32-core AMD Opteron.","PeriodicalId":309291,"journal":{"name":"2014 IEEE 28th International Parallel and Distributed Processing Symposium","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129849743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Multi-resource Real-Time Reader/Writer Locks for Multiprocessors 多处理器的多资源实时读/写锁

2014 IEEE 28th International Parallel and Distributed Processing Symposium Pub Date : 2014-05-19 DOI: 10.1109/IPDPS.2014.29

Bryan C. Ward, James H. Anderson

引用次数: 21