2014 43rd International Conference on Parallel Processing最新文献

Performance Modeling for RDMA-Enhanced Hadoop MapReduce 基于rdma增强的Hadoop MapReduce性能建模

2014 43rd International Conference on Parallel Processing Pub Date : 2014-11-20 DOI: 10.1109/ICPP.2014.14

Md. Wasi-ur-Rahman, Xiaoyi Lu, Nusrat S. Islam, D. Panda

{"title":"Performance Modeling for RDMA-Enhanced Hadoop MapReduce","authors":"Md. Wasi-ur-Rahman, Xiaoyi Lu, Nusrat S. Islam, D. Panda","doi":"10.1109/ICPP.2014.14","DOIUrl":"https://doi.org/10.1109/ICPP.2014.14","url":null,"abstract":"Hadoop MapReduce is a popular parallel programming paradigm that allows scalable and fault-tolerant solutions to data-intensive applications on modern clusters. However, the performance behavior of this framework shows its inability to take advantage of high-performance interconnects. Recent studies show that by leveraging the benefits of high-performance interconnects, the overall performance of MapReduce jobs can be greatly enhanced by using additional features like in-memory merge, pipelined merge and reduce, and pre-fetching and caching of map outputs. Existing performance models are not sufficient to predict the performance behavior for RDMA-enhanced MapReduce with these features. In this paper, we propose a detailed mathematical model of RDMA-enhanced MapReduce based on a number of cluster-wide and job-level configuration parameters. We also propose a simplified version of this model for prediction of large-scale MapReduce job executions and validate it in various system and workload configurations. Results derived from the proposed model match the experimental results within a 2-11% range. To the best of our knowledge, this is the first model that correctly predicts the behavior for RDMA-enhanced Hadoop MapReduce.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115073931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Parallel Simulation of Superscalar Scheduling 超标量调度的并行仿真

2014 43rd International Conference on Parallel Processing Pub Date : 2014-11-20 DOI: 10.1109/ICPP.2014.21

Blake Haugen, J. Kurzak, A. YarKhan, P. Luszczek, J. Dongarra

{"title":"Parallel Simulation of Superscalar Scheduling","authors":"Blake Haugen, J. Kurzak, A. YarKhan, P. Luszczek, J. Dongarra","doi":"10.1109/ICPP.2014.21","DOIUrl":"https://doi.org/10.1109/ICPP.2014.21","url":null,"abstract":"Computers have been moving toward a multicore paradigm for the last several years. As a result of the recent multicore paradigm shift, software developers must design applications that exploit the inherent parallelism of modern computing architectures. One of the areas of research to simplify this shift is the development of dynamic scheduling utilities that allow the developer to specify serial code that can be parallelized using a library or compiler technology. While these tools certainly increase the developer's productivity, they can obfuscate performance bottlenecks. For this reason, it is important to evaluate algorithm performance in order to ensure that the performance of a given algorithm is being realized using dynamic scheduling utilities. This paper presents the methodology and results of a new performance analysis tool that aims to accurately simulate the performance of various superscalar schedulers, including OmpSs, StarPU, and QUARK. The process begins with careful timing of each of the computational routines that make up the algorithm. The simulation tool then uses the timing of the computational kernels in conjunction with the dependency management provided by the superscalar scheduler in order to simulate the execution time of the algorithm. This tool demonstrates that simulation results of various algorithms can accurately predict the performance of a complex dynamic scheduling system.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129450675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

A Hybrid CPU-GPU System for Stitching Large Scale Optical Microscopy Images 大型光学显微镜图像拼接的CPU-GPU混合系统

2014 43rd International Conference on Parallel Processing Pub Date : 2014-11-20 DOI: 10.1109/ICPP.2014.9

Timothy Blattner, Walid Keyrouz, J. Chalfoun, Bertrand Stivalet, M. Brady, Shujia Zhou

{"title":"A Hybrid CPU-GPU System for Stitching Large Scale Optical Microscopy Images","authors":"Timothy Blattner, Walid Keyrouz, J. Chalfoun, Bertrand Stivalet, M. Brady, Shujia Zhou","doi":"10.1109/ICPP.2014.9","DOIUrl":"https://doi.org/10.1109/ICPP.2014.9","url":null,"abstract":"Researchers in various fields are using optical microscopy to acquire very large images, 10000 - 200000 of pixels per side. Optical microscopes acquire these images as grids of overlapping partial images (thousands of pixels per side) that are then stitched together via software. Composing such large images is a compute and data intensive task even for modern machines. Researchers compound this difficulty further by obtaining time-series, volumetric, or multiple channel images with the resulting data sets now having or approaching terabyte sizes. We present a scalable hybrid CPU-GPU implementation of image stitching that processes large image sets at near interactive rates. Our implementation scales well with both image sizes and the number of CPU cores and GPU cards in a machine. It processes a grid of 42 × 59 tiles into a 17 k × 22 k pixels image in 43 s (end-to-end execution times) when using one NVIDIA Tesla C2070 card and two Intel Xeon E-5620 quad-core CPUs, and in 29 s when using two Tesla C2070 cards and the same two CPUs. It also composes and renders the composite image without saving it in 15 s. In comparison, ImageJ/Fiji, which is widely used by biologists, has an image stitching plugin that takes > 3.6 h for the same workload despite being multithreaded and executing the same mathematical operators, it composes and saves the large image in an additional 1.5 h. This implementation takes advantage of coarse-grain parallelism. It organizes the computation into a pipeline architecture that spans CPU and GPU resources and overlaps computation with data motion. The implementation achieves a nearly 10× performance improvement over our optimized non-pipeline GPU implementation and demonstrates near-linear speedup when increasing CPU thread count and increasing number of GPUs.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117125600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

A Fast Batched Cholesky Factorization on a GPU GPU上的快速批处理Cholesky分解

2014 43rd International Conference on Parallel Processing Pub Date : 2014-11-20 DOI: 10.1109/ICPP.2014.52

Tingxing Dong, A. Haidar, S. Tomov, J. Dongarra

{"title":"A Fast Batched Cholesky Factorization on a GPU","authors":"Tingxing Dong, A. Haidar, S. Tomov, J. Dongarra","doi":"10.1109/ICPP.2014.52","DOIUrl":"https://doi.org/10.1109/ICPP.2014.52","url":null,"abstract":"Currently, state of the art libraries, like MAGMA, focus on very large linear algebra problems, while solving many small independent problems, which is usually referred to as batched problems, is not given adequate attention. In this paper, we proposed a batched Cholesky factorization on a GPU. Three algorithms -- non-blocked, blocked, and recursive blocked -- were examined. The left-looking version of the Cholesky factorization is used to factorize the panel, and the right-looking Cholesky version is used to update the trailing matrix in the recursive blocked algorithm. Our batched Cholesky achieves up to 1.8× speedup compared to the optimized parallel implementation in the MKL library on two sockets of Intel Sandy Bridge CPUs. Further, we use the new routines to develop a single Cholesky factorization solver which targets large matrix sizes. Our approach differs from MAGMA by having an entirely GPU implementation where both the panel factorization and the trailing matrix updates are on the GPU. Such an implementation does not depend on the speed of the CPU. Compared to the MAGMA library, our full GPU solution achieves 85% of the hybrid MAGMA performance which uses 16 Sandy Bridge cores, in addition to a K40 Nvidia GPU. Moreover, we achieve 80% of the practical dgemm peak of the machine, while MAGMA achieves only 75%, and finally, in terms of energy consumption, we outperform MAGMAby 1.5× in performance-per-watt for large matrices.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121956874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 36

Grover: Looking for Performance Improvement by Disabling Local Memory Usage in OpenCL Kernels Grover:通过禁用OpenCL内核中的本地内存使用来寻找性能改进

2014 43rd International Conference on Parallel Processing Pub Date : 2014-10-18 DOI: 10.1109/ICPP.2014.25

Jianbin Fang, H. Sips, P. Jääskeläinen, A. Varbanescu

{"title":"Grover: Looking for Performance Improvement by Disabling Local Memory Usage in OpenCL Kernels","authors":"Jianbin Fang, H. Sips, P. Jääskeläinen, A. Varbanescu","doi":"10.1109/ICPP.2014.25","DOIUrl":"https://doi.org/10.1109/ICPP.2014.25","url":null,"abstract":"Due to the diversity of processor architectures and application memory access patterns, the performance impact of using local memory in OpenCL kernels has become unpredictable. For example, enabling the use of local memory for an OpenCL kernel can be beneficial for the execution on a GPU, but can lead to performance losses when running on a CPU. To address this unpredictability, we propose an empirical approach: by disabling the use of local memory in OpenCL kernels, we enable users to compare the kernel versions with and without local memory, and further choose the best performing version for a given platform. To this end, we have designed Grover, a method to automatically remove local memory usage from OpenCL kernels. In particular, we create a correspondence between the global and local memory spaces, which is used to replace local memory accesses by global memory accesses. We have implemented this scheme in the LLVM framework as a compiling pass, which automatically transforms an OpenCL kernel with local memory to a version without it. We have validated Grover with 11 applications, and found that it can successfully disable local memory usage for all of them. We have compared the kernels with and without local memory on three different processors, and found performance improvements for more than a third of the test cases after Grover disabled local memory usage. We conclude that such a compiler pass can be beneficial for performance, and, because it is fully automated, it can be used as an auto-tuning step for OpenCL kernels.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"22 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134334498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

Double Free: Measurement-Free Localization for Transceiver-Free Object 双自由:无收发器目标的无测量定位

2014 43rd International Conference on Parallel Processing Pub Date : 2014-10-18 DOI: 10.1109/ICPP.2014.62

Dian Zhang, Xiaoyan Jiang, L. Ni

{"title":"Double Free: Measurement-Free Localization for Transceiver-Free Object","authors":"Dian Zhang, Xiaoyan Jiang, L. Ni","doi":"10.1109/ICPP.2014.62","DOIUrl":"https://doi.org/10.1109/ICPP.2014.62","url":null,"abstract":"Transceiver-free object localization is essential for emerging location-based service, e.g., the safe guard system and asset security. It can track indoor target without carrying any device and has attracted many research effort. Among these technologies, Radio Signal Strength (RSS) based approaches are very popular because of their low-cost and wide applicability. In such work, usually a large number of reference nodes have to be deployed. However, if in a very large area, many labor work to measure the positions of the reference nodes have to be performed, result in not practical in real scenario. In this paper, we propose Double Free, which can accurately track transceiver-free object without measuring the positions of the reference nodes. Users may randomly deploy nodes in a 2D area, e.g., the ceiling of the floor. Our Double Free contains two steps: reference node localization and target localization. The key to achieve the first step is to utilize the RSS difference in different channel to distinguish the Line-Of-Sight (LOS) signal from combined multiple paths' signal. Thus, the reference nodes can be accurately localized without additional hardware. In the second step, we propose two algorithms: Influential Link & Node (ILN) and MultiPath Distinguishing (MD). ILN is simple to implement, while MD can accurately model the additional signal caused by the target, then accurately localize the target. To implement this idea, 16 TelosB nodes are placed randomly in a 25×10m2 laboratory. The experiment results show, the average localization error is only round 2 meters without requiring to measure the positions of reference nodes in advance. It shows enormous potential in those localization areas, where manual measurement is hard to perform, or hard labor work want to be saved.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130910823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Parallel Algorithms for the Summed Area Table on the Asynchronous Hierarchical Memory Machine, with GPU implementations 异步分层存储器上求和面积表的并行算法，带GPU实现

2014 43rd International Conference on Parallel Processing Pub Date : 2014-10-18 DOI: 10.1109/ICPP.2014.34

Akihiko Kasagi, K. Nakano, Yasuaki Ito

{"title":"Parallel Algorithms for the Summed Area Table on the Asynchronous Hierarchical Memory Machine, with GPU implementations","authors":"Akihiko Kasagi, K. Nakano, Yasuaki Ito","doi":"10.1109/ICPP.2014.34","DOIUrl":"https://doi.org/10.1109/ICPP.2014.34","url":null,"abstract":"The Hierarchical Memory Machine (HMM) is a theoretical parallel computing model that captures the essence of computing on CUDA-enabled GPUs. The summed area table (SAT) of a matrix is a data structure frequently used in the area of computer vision which can be obtained by computing the column-wise prefix-sums and then the row-wise prefix-sums. The main contribution of this paper is to introduce the asynchronous Hierarchical Memory Machine (asynchronous HMM), which supports asynchronous execution of CUDA blocks, and show a global-memory-access-optimal parallel algorithm for computing the SAT on the asynchronous HMM. A straightforward algorithm (2R2W SAT algorithm) on the asynchronous HMM, which computes the prefix-sums in every column using one thread each and then computes the prefix-sums in every row, performs 2 read operations and 2 write operations per element of a matrix. The previously published best algorithm (2R1W SAT algorithm) performs 2 read operations and 1 write operation per element. We present a more efficient algorithm (1R1W SAT algorithm) which performs 1 read operation and 1 write operation per element. Clearly, since every element in a matrix must be read at least once, and all resulting values must be written, our 1R1W SAT algorithm is optimal in terms of the global memory access. We also show a combined algorithm ((1 + r)R1W SAT algorithm) of 2R1W and 1R1W SAT algorithms that may have better performance. We have implemented several algorithms including 2R2W, 2R1W, 1R1W, (1 + r)R1W SAT algorithms on GeForce GTX 780 Ti. The experimental results show that our (1 + r)R1W SAT algorithm runs faster than any other SAT algorithms for large input matrices. Also, it runs more than 100 times faster than the best SAT algorithm using a single CPU.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114606422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 25

Characterizing the Impact of Rollback Avoidance at Extreme-Scale: A Modeling Approach 表征极端尺度下回滚避免的影响:一种建模方法

2014 43rd International Conference on Parallel Processing Pub Date : 2014-10-18 DOI: 10.1109/ICPP.2014.49

Scott Levy, Kurt B. Ferreira, P. Bridges

{"title":"Characterizing the Impact of Rollback Avoidance at Extreme-Scale: A Modeling Approach","authors":"Scott Levy, Kurt B. Ferreira, P. Bridges","doi":"10.1109/ICPP.2014.49","DOIUrl":"https://doi.org/10.1109/ICPP.2014.49","url":null,"abstract":"Resilience to failure is a key concern for next-generation high-performance computing systems. The dominant fault tolerance mechanism, coordinated checkpoint/restart, is projected to no longer be a viable option on these systems due to its predicted overheads. Rollback avoidance has the potential to prolong the viability of coordinated checkpoint/restart by allowing an application to make meaningful forward progress, perhaps with degraded performance, despite the occurrence or imminence of a failure. In this paper, we present two general analytic models for the performance of rollback avoidance techniques and validate these models against the performance of existing rollback avoidance techniques. We then use these models to evaluate the applicability of rollback avoidance for next-generation exascale systems. This includes analysis of exascale system design questions such as: (1) how effective must an application-specific rollback avoidance technique be to usefully augment checkpointing in an exascale system? (2) when is rollback avoidance on its own a viable alternative to coordinated checkpointing? and (3) how do rollback avoidance techniques and system characteristics interact to influence application performance?","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125972155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Designing a Heuristic Cross-Architecture Combination for Breadth-First Search 宽度优先搜索的启发式跨架构组合设计

2014 43rd International Conference on Parallel Processing Pub Date : 2014-10-18 DOI: 10.1109/ICPP.2014.16

Yang You, David A. Bader, M. Dehnavi

引用次数: 12

A Constraint Programming-Based Resource Management Technique for Processing MapReduce Jobs with SLAs on Clouds 基于约束编程的云上sla MapReduce作业资源管理技术

2014 43rd International Conference on Parallel Processing Pub Date : 2014-10-18 DOI: 10.1109/ICPP.2014.50

Norman Lim, S. Majumdar, P. Ashwood-Smith

引用次数: 19