{"title":"The super warp architecture with random address shift","authors":"K. Nakano, Susumu Matsumae","doi":"10.1109/HiPC.2013.6799118","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799118","url":null,"abstract":"The Discrete Memory Machine (DMM) is a theoretical parallel computing model that captures the essence of memory access by a streaming multiprocessor on CUDA-enabled GPUs. The DMM has w memory banks that constitute a shared memory, and each warp of w threads access the shared memory at the same time. However, memory access requests destined for the same memory bank are processed sequentially. Hence, it is very important for developing efficient algorithms to reduce the memory access congestion, the maximum number of memory access requests destined for the same bank. However, it is not easy to minimize the memory access congestion for some problems. The main contribution of this paper is to present novel and practical parallel computing models in which the congestion is small for any memory access requests. We first present the Super Discrete Memory Machine (SDMM), an extended version of the DMM, which supports a super warp with multiple warps. Memory access requests by multiple warps in a super warp are packed through pipeline registers to reduce the memory access congestion. We then go on to apply the random address shift technique to the SDMM. The resulting machine, the Random Super Discrete Memory Machine (RSDMM) can equalize memory access requests by a super warp. Quite surprisingly, for any memory access requests by a super warp on the RSDMM, the overhead of the memory access congestion is within a constant factor of perfectly scheduled memory access. Thus, unlike the DMM, developers of parallel algorithms do not have to consider the memory access congestion on the RSDMM. The congestion on the RSDMM is evaluated by theoretical analysis as well as by experiments.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"283 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129515524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jyotsna Khemka, Mrugesh R. Gajjar, Sharan Vaswani, N. Vydyanathan, Ramakrishna M. V. Malladi, V. VinuthaS.
{"title":"Performance evaluation of medical imaging algorithms on Intel® MIC platform","authors":"Jyotsna Khemka, Mrugesh R. Gajjar, Sharan Vaswani, N. Vydyanathan, Ramakrishna M. V. Malladi, V. VinuthaS.","doi":"10.1109/HiPC.2013.6799121","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799121","url":null,"abstract":"Heterogeneous computer architectures, where CPUs co-exist with accelerators such as vector coprocessors, GPUs and FPGAs, are rapidly evolving to be powerful platforms for tomorrow's exa-scale computing. The Intel® Many Integrated Core (MIC) architecture is Intel's first step towards heterogeneous computing. This paper investigates the performance of the MIC platform in the context of medical imaging and signal processing. Specifically, we analyze the achieved performance of two popular algorithms: Complex Finite Impulse Response (FIR) filtering which is used in ultrasound signal processing and Simultaneous Algebraic Reconstruction Technique (SART) which is used in 3D Computed tomography (CT) volume reconstruction. These algorithms are evaluated on Intel® Xeon Phi™ using Intel's heterogeneous offload model. Our analysis indicates that execution times of both of these algorithms are dominated by the memory access times and hence effective cache utilization as well as vectorization play a significant role in determining the achieved performance. Overall, we perceive that Intel® MIC is an easy-to-program accelerator of the future that shows good potential in terms of performance.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131737612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"GAGM: Genome assembly on GPU using mate pairs","authors":"Ashutosh Jain, Anshuj Garg, K. Paul","doi":"10.1109/HiPC.2013.6799107","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799107","url":null,"abstract":"Genome fragment assembly has long been a time and computation intensive problem in the field of bioinformatics. Many parallel assemblers have been proposed to accelerate the process but there hasn't been any effective approach proposed for GPUs. Also with the increasing power of GPUs, applications from various research fields are being parallelized to take advantage of the massive number of “cores” available in GPUs. In this paper we present the design and development of a GPU based assembler (GAGM) for sequence assembly using Nvidia's GPUs with the CUDA programming model. Our assembler utilizes the mate pair reads produced by the current NGS technologies to build paired de Bruijn graph. Every paired read is broken into paired k-mers and l-mers. Every paired k-mer represents a vertex and paired l-mers are mapped as edges. Contigs are formed by grouping the regions of graph which can be unambiguously connected. We present parallel algorithms for k - mer extraction, paired de Bruijn graph construction and grouping of edges. We have benchmarked GAGM on four bacterial genomes. Our results show that the design on GPU is effective in terms of time as well as the quality of assembly produced.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"137 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123475684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Branch-and-Bound algorithm using multiple GPU-based LP solvers","authors":"Xavier Meyer, B. Chopard, P. Albuquerque","doi":"10.1109/HiPC.2013.6799105","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799105","url":null,"abstract":"The Branch-and-Bound (B&B) method is a well-known optimization algorithm for solving integer linear programming (ILP) models in the field of operations research. It is part of software often employed by businesses for finding solutions to problems such as airline scheduling problems. It operates according to a divide-and-conquer principle by building a tree-like structure with nodes that represent linear programming (LP) problems. A LP solver commonly used to process the nodes is the simplex method. Nowadays its sequential implementation can be found in almost all commercial ILP solvers. In this paper, we present a hybrid CPU-GPU implementation of the B&B algorithm. The B&B tree is managed by the CPU, while the revised simplex method is mainly a GPU implementation, relying on the CUDA technology of NVIDIA. The CPU manages concurrently multiple instances of the LP solver. The principal difference with a sequential implementation of the B&B algorithm pertains to the LP solver, provided that the B&B tree is managed with the same strategy. We thus compared our GPU-based implementation of the revised simplex to a well-known open-source sequential solver, named CLP, of the COIN-OR project. For given problem densities, we measured a size threshhold beyond which our GPU implementation outperformed its sequential counterpart.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124250714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Algorithms for the relaxed Multiple-Organization Multiple-Machine Scheduling Problem","authors":"Anirudh Chakravorty, Neelima Gupta, Neha Lawaria, Pankaj Kumar, Yogish Sabharwal","doi":"10.1109/HiPC.2013.6799127","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799127","url":null,"abstract":"In this paper we present the generalization of the relaxed Multi- Organization Scheduling Problem (α MOSP). In our generalized problem, we are given a set of organizations; each organization is comprised of a set of machines. We are interested in minimizing the global makespan while allowing a constant factor, αO, degradation in the local objective of each organization and a constant factor, αM, degradation in the local objective of each machine. Previous work on α MOSP have primarily focussed on the degree of co-operativeness only at organization level whereas the degree of co-operativeness of an individual machine is also equally important. We develop a general framework for building approximation algorithms for the problem. Using this framework we present a family of approximation algorithms with varying approximation guarantees on the global makespan and the degrees of cooperativeness of the machines and organizations. In particular, we present (4, 2, 3), (4, 3, 2) and (3, 3, 3) approximation results where the first, and second values in the triplet represent the degree of co-operativeness of the machines and the organizations respectively and the third value denotes approximation guarantee for the global makespan. We also present and experimentally analyze different heuristics to improve the global makespan once solutions with the above theoretical guarantees are obtained.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124909337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A new parallel algorithm for connected components in dynamic graphs","authors":"R. McColl, Oded Green, David A. Bader","doi":"10.1109/HiPC.2013.6799108","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799108","url":null,"abstract":"Social networks, communication networks, business intelligence databases, and large scientific data sources now contain hundreds of millions elements with billions of relationships. The relationships in these massive datasets are changing at ever-faster rates. Through representing these datasets as dynamic and semantic graphs of vertices and edges, it is possible to characterize the structure of the relationships and to quickly respond to queries about how the elements in the set are connected. Statically computing analytics on snapshots of these dynamic graphs is frequently not fast enough to provide current and accurate information as the graph changes. This has led to the development of dynamic graph algorithms that can maintain analytic information without resorting to full static recomputation. In this work we present a novel parallel algorithm for tracking the connected components of a dynamic graph. Our approach has a low memory requirement of O(V) and is appropriate for all graph densities. On a graph with 512 million edges, we show that our new dynamic algorithm is up to 128X faster than well-known static algorithms and that our algorithm achieves a 14X parallel speedup on a x86 64-core shared-memory system. To the best of the authors' knowledge, this is the first parallel implementation of dynamic connected components that does not eventually require static recomputation.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127918694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Minimization of cloud task execution length with workload prediction errors","authors":"S. Di, Cho-Li Wang","doi":"10.1109/HiPC.2013.6799101","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799101","url":null,"abstract":"In cloud systems, it is non-trivial to optimize task's execution performance under user's affordable budget, especially with possible workload prediction errors. Based on an optimal algorithm that can minimize cloud task's execution length with predicted workload and budget, we theoretically derive the upper bound of the task execution length by taking into account the possible workload prediction errors. With such a state-of-the-art bound, the worst-case performance of a task execution with a certain workload prediction errors is predictable. On the other hand, we build a close-to-practice cloud prototype over a real cluster environment deployed with 56 virtual machines, and evaluate our solution with different resource contention degrees. Experiments show that task execution lengths under our solution with estimates of worst-case performance are close to their theoretical ideal values, in both non-competitive situation with adequate resources and the competitive situation with a certain limited available resources. We also observe a fair treatment on the resource allocation among all tasks.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120953705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Effects of phase imbalance on data center energy management","authors":"Sushil Gupta, Ayan Banerjee, Z. Abbasi, S. Gupta","doi":"10.1109/HiPC.2013.6799099","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799099","url":null,"abstract":"Phase imbalance has been considered as a source of inefficiency in the data center that causes energy loss due to line impedance and increases reactive power. Strategies assume high loss due to phase imbalance and propose sophisticated energy management algorithms including phase balance aware workload scheduling algorithm and dynamic power distribution unit assignment to servers. However, such attempts do not utilize an objective measure of the inefficiencies due to phase imbalance to evaluate the significance of their contributions. Excessive imbalance in a three phase load has various undesirable effects. This paper, first objectively characterizes the inefficiencies due to phase imbalance and then provides numerical measures of the losses in realistic data center deployments. Phase imbalanced load in a delta configuration results in reduced power factor, which is undesirable for several reasons. Also, an imbalanced load (both in delta or star configuration), results in higher line currents, leading to higher line loss. However, this increase in loss is a fraction of a percentage of the energy consumed. The paper also discusses effects of work load scheduling on phase imbalance, and how to minimize the same.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"149 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131852868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Scheduling associative reductions with homogeneous costs when overlapping communications and computations","authors":"Louis-Claude Canon","doi":"10.1109/HiPC.2013.6799124","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799124","url":null,"abstract":"Reduction is a core operation in parallel computing that combines distributed elements into a single result. Optimizing its cost may greatly reduce the application execution time, notably in MPI and MapReduce computations. In this paper, we propose an algorithm for scheduling associative reductions. We focus on the case where communications and computations can be overlapped to fully exploit resources. Our algorithm greedily builds a spanning tree by starting from the root and by adding a child at each iteration. Bounds on the completion time of optimal schedules are then characterized. To show the algorithm extensibility, we adapt it to model variations in which either communication or computation resources are limited. Moreover, we study two specific spanning trees: while the binomial tree is optimal when there is either no transfer or no computation, the k-ary Fibonacci tree is optimal when the transfer cost is equal to the computation cost. Finally, approximation ratios of strategies based on those trees are derived.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134536406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A self-tuning system based on application Profiling and Performance Analysis for optimizing Hadoop MapReduce cluster configuration","authors":"Dili Wu, A. Gokhale","doi":"10.1109/HiPC.2013.6799133","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799133","url":null,"abstract":"One of the most widely used frameworks for programming MapReduce-based applications is Apache Hadoop. Despite its popularity, however, application developers face numerous challenges in using the Hadoop framework, which stem from them having to effectively manage the resources of a MapReduce cluster, and configuring the framework in a way that will optimize the performance and reliability of MapReduce applications running on it. This paper addresses these problems by presenting the Profiling and Performance Analysis-based System (PPABS) framework, which automates the tuning of Hadoop configuration settings based on deduced application performance requirements. The PPABS framework comprises two distinct phases called the Analyzer, which trains PPABS to form a set of equivalence classes of MapReduce applications for which the most appropriate Hadoop config- uration parameters that maximally improve performance for that class are determined, and the Recognizer, which classifies an incoming unknown job to one of these equivalence classes so that its Hadoop configuration parameters can be self-tuned. The key research contributions in the Analyzer phase includes modifications to the well-known k - means + + clustering and Simulated Annealing algorithms, which were required to adapt them to the MapReduce paradigm. The key contributions in the Recognizer phase includes an approach to classify an unknown, incoming job to one of the equivalence classes and a control strategy to self-tune the Hadoop cluster configuration parameters for that job. Experimental results comparing the performance improvements for three different classes of applications running on Hadoop clusters deployed on Amazon EC2 show promising results.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128674351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}