20th Annual International Conference on High Performance Computing最新文献_第3页

The super warp architecture with random address shift 具有随机地址移位的超曲结构

20th Annual International Conference on High Performance Computing Pub Date : 2013-12-01 DOI: 10.1109/HiPC.2013.6799118

K. Nakano, Susumu Matsumae

{"title":"The super warp architecture with random address shift","authors":"K. Nakano, Susumu Matsumae","doi":"10.1109/HiPC.2013.6799118","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799118","url":null,"abstract":"The Discrete Memory Machine (DMM) is a theoretical parallel computing model that captures the essence of memory access by a streaming multiprocessor on CUDA-enabled GPUs. The DMM has w memory banks that constitute a shared memory, and each warp of w threads access the shared memory at the same time. However, memory access requests destined for the same memory bank are processed sequentially. Hence, it is very important for developing efficient algorithms to reduce the memory access congestion, the maximum number of memory access requests destined for the same bank. However, it is not easy to minimize the memory access congestion for some problems. The main contribution of this paper is to present novel and practical parallel computing models in which the congestion is small for any memory access requests. We first present the Super Discrete Memory Machine (SDMM), an extended version of the DMM, which supports a super warp with multiple warps. Memory access requests by multiple warps in a super warp are packed through pipeline registers to reduce the memory access congestion. We then go on to apply the random address shift technique to the SDMM. The resulting machine, the Random Super Discrete Memory Machine (RSDMM) can equalize memory access requests by a super warp. Quite surprisingly, for any memory access requests by a super warp on the RSDMM, the overhead of the memory access congestion is within a constant factor of perfectly scheduled memory access. Thus, unlike the DMM, developers of parallel algorithms do not have to consider the memory access congestion on the RSDMM. The congestion on the RSDMM is evaluated by theoretical analysis as well as by experiments.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"283 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129515524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Performance evaluation of medical imaging algorithms on Intel® MIC platform Intel®MIC平台上医学成像算法的性能评估

20th Annual International Conference on High Performance Computing Pub Date : 2013-12-01 DOI: 10.1109/HiPC.2013.6799121

Jyotsna Khemka, Mrugesh R. Gajjar, Sharan Vaswani, N. Vydyanathan, Ramakrishna M. V. Malladi, V. VinuthaS.

{"title":"Performance evaluation of medical imaging algorithms on Intel® MIC platform","authors":"Jyotsna Khemka, Mrugesh R. Gajjar, Sharan Vaswani, N. Vydyanathan, Ramakrishna M. V. Malladi, V. VinuthaS.","doi":"10.1109/HiPC.2013.6799121","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799121","url":null,"abstract":"Heterogeneous computer architectures, where CPUs co-exist with accelerators such as vector coprocessors, GPUs and FPGAs, are rapidly evolving to be powerful platforms for tomorrow's exa-scale computing. The Intel® Many Integrated Core (MIC) architecture is Intel's first step towards heterogeneous computing. This paper investigates the performance of the MIC platform in the context of medical imaging and signal processing. Specifically, we analyze the achieved performance of two popular algorithms: Complex Finite Impulse Response (FIR) filtering which is used in ultrasound signal processing and Simultaneous Algebraic Reconstruction Technique (SART) which is used in 3D Computed tomography (CT) volume reconstruction. These algorithms are evaluated on Intel® Xeon Phi™ using Intel's heterogeneous offload model. Our analysis indicates that execution times of both of these algorithms are dominated by the memory access times and hence effective cache utilization as well as vectorization play a significant role in determining the achieved performance. Overall, we perceive that Intel® MIC is an easy-to-program accelerator of the future that shows good potential in terms of performance.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131737612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

GAGM: Genome assembly on GPU using mate pairs ggm:利用配偶对在GPU上组装基因组

20th Annual International Conference on High Performance Computing Pub Date : 2013-12-01 DOI: 10.1109/HiPC.2013.6799107

Ashutosh Jain, Anshuj Garg, K. Paul

{"title":"GAGM: Genome assembly on GPU using mate pairs","authors":"Ashutosh Jain, Anshuj Garg, K. Paul","doi":"10.1109/HiPC.2013.6799107","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799107","url":null,"abstract":"Genome fragment assembly has long been a time and computation intensive problem in the field of bioinformatics. Many parallel assemblers have been proposed to accelerate the process but there hasn't been any effective approach proposed for GPUs. Also with the increasing power of GPUs, applications from various research fields are being parallelized to take advantage of the massive number of “cores” available in GPUs. In this paper we present the design and development of a GPU based assembler (GAGM) for sequence assembly using Nvidia's GPUs with the CUDA programming model. Our assembler utilizes the mate pair reads produced by the current NGS technologies to build paired de Bruijn graph. Every paired read is broken into paired k-mers and l-mers. Every paired k-mer represents a vertex and paired l-mers are mapped as edges. Contigs are formed by grouping the regions of graph which can be unambiguously connected. We present parallel algorithms for k - mer extraction, paired de Bruijn graph construction and grouping of edges. We have benchmarked GAGM on four bacterial genomes. Our results show that the design on GPU is effective in terms of time as well as the quality of assembly produced.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"137 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123475684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

A Branch-and-Bound algorithm using multiple GPU-based LP solvers 使用多个基于gpu的LP求解器的分支定界算法

20th Annual International Conference on High Performance Computing Pub Date : 2013-12-01 DOI: 10.1109/HiPC.2013.6799105

Xavier Meyer, B. Chopard, P. Albuquerque

{"title":"A Branch-and-Bound algorithm using multiple GPU-based LP solvers","authors":"Xavier Meyer, B. Chopard, P. Albuquerque","doi":"10.1109/HiPC.2013.6799105","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799105","url":null,"abstract":"The Branch-and-Bound (B&B) method is a well-known optimization algorithm for solving integer linear programming (ILP) models in the field of operations research. It is part of software often employed by businesses for finding solutions to problems such as airline scheduling problems. It operates according to a divide-and-conquer principle by building a tree-like structure with nodes that represent linear programming (LP) problems. A LP solver commonly used to process the nodes is the simplex method. Nowadays its sequential implementation can be found in almost all commercial ILP solvers. In this paper, we present a hybrid CPU-GPU implementation of the B&B algorithm. The B&B tree is managed by the CPU, while the revised simplex method is mainly a GPU implementation, relying on the CUDA technology of NVIDIA. The CPU manages concurrently multiple instances of the LP solver. The principal difference with a sequential implementation of the B&B algorithm pertains to the LP solver, provided that the B&B tree is managed with the same strategy. We thus compared our GPU-based implementation of the revised simplex to a well-known open-source sequential solver, named CLP, of the COIN-OR project. For given problem densities, we measured a size threshhold beyond which our GPU implementation outperformed its sequential counterpart.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124250714","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Algorithms for the relaxed Multiple-Organization Multiple-Machine Scheduling Problem 松弛多组织多机器调度问题的算法

20th Annual International Conference on High Performance Computing Pub Date : 2013-12-01 DOI: 10.1109/HiPC.2013.6799127

Anirudh Chakravorty, Neelima Gupta, Neha Lawaria, Pankaj Kumar, Yogish Sabharwal

{"title":"Algorithms for the relaxed Multiple-Organization Multiple-Machine Scheduling Problem","authors":"Anirudh Chakravorty, Neelima Gupta, Neha Lawaria, Pankaj Kumar, Yogish Sabharwal","doi":"10.1109/HiPC.2013.6799127","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799127","url":null,"abstract":"In this paper we present the generalization of the relaxed Multi- Organization Scheduling Problem (α MOSP). In our generalized problem, we are given a set of organizations; each organization is comprised of a set of machines. We are interested in minimizing the global makespan while allowing a constant factor, αO, degradation in the local objective of each organization and a constant factor, αM, degradation in the local objective of each machine. Previous work on α MOSP have primarily focussed on the degree of co-operativeness only at organization level whereas the degree of co-operativeness of an individual machine is also equally important. We develop a general framework for building approximation algorithms for the problem. Using this framework we present a family of approximation algorithms with varying approximation guarantees on the global makespan and the degrees of cooperativeness of the machines and organizations. In particular, we present (4, 2, 3), (4, 3, 2) and (3, 3, 3) approximation results where the first, and second values in the triplet represent the degree of co-operativeness of the machines and the organizations respectively and the third value denotes approximation guarantee for the global makespan. We also present and experimentally analyze different heuristics to improve the global makespan once solutions with the above theoretical guarantees are obtained.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"57 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124909337","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

A new parallel algorithm for connected components in dynamic graphs 动态图中连通分量的一种新的并行算法

20th Annual International Conference on High Performance Computing Pub Date : 2013-12-01 DOI: 10.1109/HiPC.2013.6799108

R. McColl, Oded Green, David A. Bader

{"title":"A new parallel algorithm for connected components in dynamic graphs","authors":"R. McColl, Oded Green, David A. Bader","doi":"10.1109/HiPC.2013.6799108","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799108","url":null,"abstract":"Social networks, communication networks, business intelligence databases, and large scientific data sources now contain hundreds of millions elements with billions of relationships. The relationships in these massive datasets are changing at ever-faster rates. Through representing these datasets as dynamic and semantic graphs of vertices and edges, it is possible to characterize the structure of the relationships and to quickly respond to queries about how the elements in the set are connected. Statically computing analytics on snapshots of these dynamic graphs is frequently not fast enough to provide current and accurate information as the graph changes. This has led to the development of dynamic graph algorithms that can maintain analytic information without resorting to full static recomputation. In this work we present a novel parallel algorithm for tracking the connected components of a dynamic graph. Our approach has a low memory requirement of O(V) and is appropriate for all graph densities. On a graph with 512 million edges, we show that our new dynamic algorithm is up to 128X faster than well-known static algorithms and that our algorithm achieves a 14X parallel speedup on a x86 64-core shared-memory system. To the best of the authors' knowledge, this is the first parallel implementation of dynamic connected components that does not eventually require static recomputation.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127918694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 39

Minimization of cloud task execution length with workload prediction errors 最小化具有工作负载预测错误的云任务执行长度

20th Annual International Conference on High Performance Computing Pub Date : 2013-12-01 DOI: 10.1109/HiPC.2013.6799101

S. Di, Cho-Li Wang

引用次数: 5

Effects of phase imbalance on data center energy management 相位不平衡对数据中心能量管理的影响

20th Annual International Conference on High Performance Computing Pub Date : 2013-12-01 DOI: 10.1109/HiPC.2013.6799099

Sushil Gupta, Ayan Banerjee, Z. Abbasi, S. Gupta

{"title":"Effects of phase imbalance on data center energy management","authors":"Sushil Gupta, Ayan Banerjee, Z. Abbasi, S. Gupta","doi":"10.1109/HiPC.2013.6799099","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799099","url":null,"abstract":"Phase imbalance has been considered as a source of inefficiency in the data center that causes energy loss due to line impedance and increases reactive power. Strategies assume high loss due to phase imbalance and propose sophisticated energy management algorithms including phase balance aware workload scheduling algorithm and dynamic power distribution unit assignment to servers. However, such attempts do not utilize an objective measure of the inefficiencies due to phase imbalance to evaluate the significance of their contributions. Excessive imbalance in a three phase load has various undesirable effects. This paper, first objectively characterizes the inefficiencies due to phase imbalance and then provides numerical measures of the losses in realistic data center deployments. Phase imbalanced load in a delta configuration results in reduced power factor, which is undesirable for several reasons. Also, an imbalanced load (both in delta or star configuration), results in higher line currents, leading to higher line loss. However, this increase in loss is a fraction of a percentage of the energy consumed. The paper also discusses effects of work load scheduling on phase imbalance, and how to minimize the same.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"149 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131852868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Scheduling associative reductions with homogeneous costs when overlapping communications and computations 当通信和计算重叠时，调度具有同质成本的关联缩减

20th Annual International Conference on High Performance Computing Pub Date : 2013-12-01 DOI: 10.1109/HiPC.2013.6799124

Louis-Claude Canon

引用次数: 3

A self-tuning system based on application Profiling and Performance Analysis for optimizing Hadoop MapReduce cluster configuration 基于应用分析和性能分析的自调优系统，用于优化Hadoop MapReduce集群配置

20th Annual International Conference on High Performance Computing Pub Date : 2013-12-01 DOI: 10.1109/HiPC.2013.6799133

Dili Wu, A. Gokhale

{"title":"A self-tuning system based on application Profiling and Performance Analysis for optimizing Hadoop MapReduce cluster configuration","authors":"Dili Wu, A. Gokhale","doi":"10.1109/HiPC.2013.6799133","DOIUrl":"https://doi.org/10.1109/HiPC.2013.6799133","url":null,"abstract":"One of the most widely used frameworks for programming MapReduce-based applications is Apache Hadoop. Despite its popularity, however, application developers face numerous challenges in using the Hadoop framework, which stem from them having to effectively manage the resources of a MapReduce cluster, and configuring the framework in a way that will optimize the performance and reliability of MapReduce applications running on it. This paper addresses these problems by presenting the Profiling and Performance Analysis-based System (PPABS) framework, which automates the tuning of Hadoop configuration settings based on deduced application performance requirements. The PPABS framework comprises two distinct phases called the Analyzer, which trains PPABS to form a set of equivalence classes of MapReduce applications for which the most appropriate Hadoop config- uration parameters that maximally improve performance for that class are determined, and the Recognizer, which classifies an incoming unknown job to one of these equivalence classes so that its Hadoop configuration parameters can be self-tuned. The key research contributions in the Analyzer phase includes modifications to the well-known k - means + + clustering and Simulated Annealing algorithms, which were required to adapt them to the MapReduce paradigm. The key contributions in the Recognizer phase includes an approach to classify an unknown, incoming job to one of the equivalence classes and a control strategy to self-tune the Hadoop cluster configuration parameters for that job. Experimental results comparing the performance improvements for three different classes of applications running on Hadoop clusters deployed on Amazon EC2 show promising results.","PeriodicalId":206307,"journal":{"name":"20th Annual International Conference on High Performance Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128674351","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 56