Md. Wasi-ur-Rahman, Xiaoyi Lu, Nusrat S. Islam, D. Panda
{"title":"Performance Modeling for RDMA-Enhanced Hadoop MapReduce","authors":"Md. Wasi-ur-Rahman, Xiaoyi Lu, Nusrat S. Islam, D. Panda","doi":"10.1109/ICPP.2014.14","DOIUrl":"https://doi.org/10.1109/ICPP.2014.14","url":null,"abstract":"Hadoop MapReduce is a popular parallel programming paradigm that allows scalable and fault-tolerant solutions to data-intensive applications on modern clusters. However, the performance behavior of this framework shows its inability to take advantage of high-performance interconnects. Recent studies show that by leveraging the benefits of high-performance interconnects, the overall performance of MapReduce jobs can be greatly enhanced by using additional features like in-memory merge, pipelined merge and reduce, and pre-fetching and caching of map outputs. Existing performance models are not sufficient to predict the performance behavior for RDMA-enhanced MapReduce with these features. In this paper, we propose a detailed mathematical model of RDMA-enhanced MapReduce based on a number of cluster-wide and job-level configuration parameters. We also propose a simplified version of this model for prediction of large-scale MapReduce job executions and validate it in various system and workload configurations. Results derived from the proposed model match the experimental results within a 2-11% range. To the best of our knowledge, this is the first model that correctly predicts the behavior for RDMA-enhanced Hadoop MapReduce.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115073931","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Blake Haugen, J. Kurzak, A. YarKhan, P. Luszczek, J. Dongarra
{"title":"Parallel Simulation of Superscalar Scheduling","authors":"Blake Haugen, J. Kurzak, A. YarKhan, P. Luszczek, J. Dongarra","doi":"10.1109/ICPP.2014.21","DOIUrl":"https://doi.org/10.1109/ICPP.2014.21","url":null,"abstract":"Computers have been moving toward a multicore paradigm for the last several years. As a result of the recent multicore paradigm shift, software developers must design applications that exploit the inherent parallelism of modern computing architectures. One of the areas of research to simplify this shift is the development of dynamic scheduling utilities that allow the developer to specify serial code that can be parallelized using a library or compiler technology. While these tools certainly increase the developer's productivity, they can obfuscate performance bottlenecks. For this reason, it is important to evaluate algorithm performance in order to ensure that the performance of a given algorithm is being realized using dynamic scheduling utilities. This paper presents the methodology and results of a new performance analysis tool that aims to accurately simulate the performance of various superscalar schedulers, including OmpSs, StarPU, and QUARK. The process begins with careful timing of each of the computational routines that make up the algorithm. The simulation tool then uses the timing of the computational kernels in conjunction with the dependency management provided by the superscalar scheduler in order to simulate the execution time of the algorithm. This tool demonstrates that simulation results of various algorithms can accurately predict the performance of a complex dynamic scheduling system.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129450675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Timothy Blattner, Walid Keyrouz, J. Chalfoun, Bertrand Stivalet, M. Brady, Shujia Zhou
{"title":"A Hybrid CPU-GPU System for Stitching Large Scale Optical Microscopy Images","authors":"Timothy Blattner, Walid Keyrouz, J. Chalfoun, Bertrand Stivalet, M. Brady, Shujia Zhou","doi":"10.1109/ICPP.2014.9","DOIUrl":"https://doi.org/10.1109/ICPP.2014.9","url":null,"abstract":"Researchers in various fields are using optical microscopy to acquire very large images, 10000 - 200000 of pixels per side. Optical microscopes acquire these images as grids of overlapping partial images (thousands of pixels per side) that are then stitched together via software. Composing such large images is a compute and data intensive task even for modern machines. Researchers compound this difficulty further by obtaining time-series, volumetric, or multiple channel images with the resulting data sets now having or approaching terabyte sizes. We present a scalable hybrid CPU-GPU implementation of image stitching that processes large image sets at near interactive rates. Our implementation scales well with both image sizes and the number of CPU cores and GPU cards in a machine. It processes a grid of 42 × 59 tiles into a 17 k × 22 k pixels image in 43 s (end-to-end execution times) when using one NVIDIA Tesla C2070 card and two Intel Xeon E-5620 quad-core CPUs, and in 29 s when using two Tesla C2070 cards and the same two CPUs. It also composes and renders the composite image without saving it in 15 s. In comparison, ImageJ/Fiji, which is widely used by biologists, has an image stitching plugin that takes > 3.6 h for the same workload despite being multithreaded and executing the same mathematical operators, it composes and saves the large image in an additional 1.5 h. This implementation takes advantage of coarse-grain parallelism. It organizes the computation into a pipeline architecture that spans CPU and GPU resources and overlaps computation with data motion. The implementation achieves a nearly 10× performance improvement over our optimized non-pipeline GPU implementation and demonstrates near-linear speedup when increasing CPU thread count and increasing number of GPUs.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117125600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Fast Batched Cholesky Factorization on a GPU","authors":"Tingxing Dong, A. Haidar, S. Tomov, J. Dongarra","doi":"10.1109/ICPP.2014.52","DOIUrl":"https://doi.org/10.1109/ICPP.2014.52","url":null,"abstract":"Currently, state of the art libraries, like MAGMA, focus on very large linear algebra problems, while solving many small independent problems, which is usually referred to as batched problems, is not given adequate attention. In this paper, we proposed a batched Cholesky factorization on a GPU. Three algorithms -- non-blocked, blocked, and recursive blocked -- were examined. The left-looking version of the Cholesky factorization is used to factorize the panel, and the right-looking Cholesky version is used to update the trailing matrix in the recursive blocked algorithm. Our batched Cholesky achieves up to 1.8× speedup compared to the optimized parallel implementation in the MKL library on two sockets of Intel Sandy Bridge CPUs. Further, we use the new routines to develop a single Cholesky factorization solver which targets large matrix sizes. Our approach differs from MAGMA by having an entirely GPU implementation where both the panel factorization and the trailing matrix updates are on the GPU. Such an implementation does not depend on the speed of the CPU. Compared to the MAGMA library, our full GPU solution achieves 85% of the hybrid MAGMA performance which uses 16 Sandy Bridge cores, in addition to a K40 Nvidia GPU. Moreover, we achieve 80% of the practical dgemm peak of the machine, while MAGMA achieves only 75%, and finally, in terms of energy consumption, we outperform MAGMAby 1.5× in performance-per-watt for large matrices.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121956874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jianbin Fang, H. Sips, P. Jääskeläinen, A. Varbanescu
{"title":"Grover: Looking for Performance Improvement by Disabling Local Memory Usage in OpenCL Kernels","authors":"Jianbin Fang, H. Sips, P. Jääskeläinen, A. Varbanescu","doi":"10.1109/ICPP.2014.25","DOIUrl":"https://doi.org/10.1109/ICPP.2014.25","url":null,"abstract":"Due to the diversity of processor architectures and application memory access patterns, the performance impact of using local memory in OpenCL kernels has become unpredictable. For example, enabling the use of local memory for an OpenCL kernel can be beneficial for the execution on a GPU, but can lead to performance losses when running on a CPU. To address this unpredictability, we propose an empirical approach: by disabling the use of local memory in OpenCL kernels, we enable users to compare the kernel versions with and without local memory, and further choose the best performing version for a given platform. To this end, we have designed Grover, a method to automatically remove local memory usage from OpenCL kernels. In particular, we create a correspondence between the global and local memory spaces, which is used to replace local memory accesses by global memory accesses. We have implemented this scheme in the LLVM framework as a compiling pass, which automatically transforms an OpenCL kernel with local memory to a version without it. We have validated Grover with 11 applications, and found that it can successfully disable local memory usage for all of them. We have compared the kernels with and without local memory on three different processors, and found performance improvements for more than a third of the test cases after Grover disabled local memory usage. We conclude that such a compiler pass can be beneficial for performance, and, because it is fully automated, it can be used as an auto-tuning step for OpenCL kernels.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"22 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134334498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Double Free: Measurement-Free Localization for Transceiver-Free Object","authors":"Dian Zhang, Xiaoyan Jiang, L. Ni","doi":"10.1109/ICPP.2014.62","DOIUrl":"https://doi.org/10.1109/ICPP.2014.62","url":null,"abstract":"Transceiver-free object localization is essential for emerging location-based service, e.g., the safe guard system and asset security. It can track indoor target without carrying any device and has attracted many research effort. Among these technologies, Radio Signal Strength (RSS) based approaches are very popular because of their low-cost and wide applicability. In such work, usually a large number of reference nodes have to be deployed. However, if in a very large area, many labor work to measure the positions of the reference nodes have to be performed, result in not practical in real scenario. In this paper, we propose Double Free, which can accurately track transceiver-free object without measuring the positions of the reference nodes. Users may randomly deploy nodes in a 2D area, e.g., the ceiling of the floor. Our Double Free contains two steps: reference node localization and target localization. The key to achieve the first step is to utilize the RSS difference in different channel to distinguish the Line-Of-Sight (LOS) signal from combined multiple paths' signal. Thus, the reference nodes can be accurately localized without additional hardware. In the second step, we propose two algorithms: Influential Link & Node (ILN) and MultiPath Distinguishing (MD). ILN is simple to implement, while MD can accurately model the additional signal caused by the target, then accurately localize the target. To implement this idea, 16 TelosB nodes are placed randomly in a 25×10m2 laboratory. The experiment results show, the average localization error is only round 2 meters without requiring to measure the positions of reference nodes in advance. It shows enormous potential in those localization areas, where manual measurement is hard to perform, or hard labor work want to be saved.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130910823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel Algorithms for the Summed Area Table on the Asynchronous Hierarchical Memory Machine, with GPU implementations","authors":"Akihiko Kasagi, K. Nakano, Yasuaki Ito","doi":"10.1109/ICPP.2014.34","DOIUrl":"https://doi.org/10.1109/ICPP.2014.34","url":null,"abstract":"The Hierarchical Memory Machine (HMM) is a theoretical parallel computing model that captures the essence of computing on CUDA-enabled GPUs. The summed area table (SAT) of a matrix is a data structure frequently used in the area of computer vision which can be obtained by computing the column-wise prefix-sums and then the row-wise prefix-sums. The main contribution of this paper is to introduce the asynchronous Hierarchical Memory Machine (asynchronous HMM), which supports asynchronous execution of CUDA blocks, and show a global-memory-access-optimal parallel algorithm for computing the SAT on the asynchronous HMM. A straightforward algorithm (2R2W SAT algorithm) on the asynchronous HMM, which computes the prefix-sums in every column using one thread each and then computes the prefix-sums in every row, performs 2 read operations and 2 write operations per element of a matrix. The previously published best algorithm (2R1W SAT algorithm) performs 2 read operations and 1 write operation per element. We present a more efficient algorithm (1R1W SAT algorithm) which performs 1 read operation and 1 write operation per element. Clearly, since every element in a matrix must be read at least once, and all resulting values must be written, our 1R1W SAT algorithm is optimal in terms of the global memory access. We also show a combined algorithm ((1 + r)R1W SAT algorithm) of 2R1W and 1R1W SAT algorithms that may have better performance. We have implemented several algorithms including 2R2W, 2R1W, 1R1W, (1 + r)R1W SAT algorithms on GeForce GTX 780 Ti. The experimental results show that our (1 + r)R1W SAT algorithm runs faster than any other SAT algorithms for large input matrices. Also, it runs more than 100 times faster than the best SAT algorithm using a single CPU.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114606422","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Characterizing the Impact of Rollback Avoidance at Extreme-Scale: A Modeling Approach","authors":"Scott Levy, Kurt B. Ferreira, P. Bridges","doi":"10.1109/ICPP.2014.49","DOIUrl":"https://doi.org/10.1109/ICPP.2014.49","url":null,"abstract":"Resilience to failure is a key concern for next-generation high-performance computing systems. The dominant fault tolerance mechanism, coordinated checkpoint/restart, is projected to no longer be a viable option on these systems due to its predicted overheads. Rollback avoidance has the potential to prolong the viability of coordinated checkpoint/restart by allowing an application to make meaningful forward progress, perhaps with degraded performance, despite the occurrence or imminence of a failure. In this paper, we present two general analytic models for the performance of rollback avoidance techniques and validate these models against the performance of existing rollback avoidance techniques. We then use these models to evaluate the applicability of rollback avoidance for next-generation exascale systems. This includes analysis of exascale system design questions such as: (1) how effective must an application-specific rollback avoidance technique be to usefully augment checkpointing in an exascale system? (2) when is rollback avoidance on its own a viable alternative to coordinated checkpointing? and (3) how do rollback avoidance techniques and system characteristics interact to influence application performance?","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125972155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Designing a Heuristic Cross-Architecture Combination for Breadth-First Search","authors":"Yang You, David A. Bader, M. Dehnavi","doi":"10.1109/ICPP.2014.16","DOIUrl":"https://doi.org/10.1109/ICPP.2014.16","url":null,"abstract":"Breadth-First Search (BFS) is widely used in real-world applications including computational biology, social networks, and electronic design automation. The most effective BFS approach has been shown to be a combination of top-down and bottom-up approaches. Such hybrid techniques need to identify a switching point which is conventionally found through expensive trial-and-error and exhaustive search routines. We present an adaptive method based on regression analysis that enables dynamic switching at runtime with little overhead. We improve the performance of our method by exploiting popular heterogeneous platforms and efficiently design the approach for a given architecture. An 155x speedup is achieved over the standard top-down approach on GPUs. Our approach is the first to combine top-down and bottom-up across different architectures. Unlike combination on a single architecture, a mistuned switching point may significantly decrease the performance of cross-architecture combination. Our adaptive method can predict the switching point with high accuracy, leading to an 695x speedup compared the worst switching point.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127549499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Constraint Programming-Based Resource Management Technique for Processing MapReduce Jobs with SLAs on Clouds","authors":"Norman Lim, S. Majumdar, P. Ashwood-Smith","doi":"10.1109/ICPP.2014.50","DOIUrl":"https://doi.org/10.1109/ICPP.2014.50","url":null,"abstract":"Clouds that are rapidly gaining in popularity require an effective resource manager that can harness the power of the underlying resource pool, and provide resources on demand to its users. This paper focuses on resource management on clouds for workflow requests characterized by Service Level Agreements (SLAs). Specifically, we devise a novel MapReduce constraint programming based resource manager (MRCP-RM) that can effectively perform matchmaking and scheduling of MapReduce jobs, each characterized by an SLA comprising an earliest start time, execution time, and an end-to-end deadline. Using discrete event simulation a performance evaluation of MRCP-RM is conducted for an open system subjected to a stream of job arrivals. The simulation results demonstrate the effectiveness of the resource manager and provide insights into system behaviour and performance.","PeriodicalId":441115,"journal":{"name":"2014 43rd International Conference on Parallel Processing","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132507968","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}