2011 IEEE International Parallel & Distributed Processing Symposium最新文献

Power and Performance Management in Priority-Type Cluster Computing Systems 优先级型集群计算系统的电源和性能管理

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.13

Kaiqi Xiong

{"title":"Power and Performance Management in Priority-Type Cluster Computing Systems","authors":"Kaiqi Xiong","doi":"10.1109/IPDPS.2011.13","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.13","url":null,"abstract":"Cluster computing not only improves performance but also increase power consumption. %over that of a single computer. It is a challenge to increase the performance of a cluster computing system and reduce its power consumption simultaneously. In this paper, we consider a collection of cluster computing resources owned by a service provider to host an enterprise application for multiple class business customers where customer requests are distinguished, with different request characteristics and service requirements. We start with a development of computing an average end-to-end delay and an average energy consumption for multiple class customers in such an application. Then, we present approaches for optimizing the average end-to-end delay subject to the constraint of an average energy consumption and optimizing the average end-to-end energy consumption subject to the constraints of an average end-to-end delay for all class and each class customer requests respectively. Moreover, a service provider processes the service requests of customers according to a service level agreement (SLA), which is a contract agreed between a customer and a service provider. It becomes important and commonplace to prioritize multiple customer services in favor of customers who are willing to pay higher fees. We propose an approach for minimizing the total cost of cluster computing resources allocated to ensure multiple priority customer service guarantees by the service provider. It is demonstrated through our simulation that the proposed approaches are efficient and accurate for power management and performance guarantees in priority-type cluster computing systems.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115407499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Fast Community Detection Algorithm with GPUs and Multicore Architectures 基于gpu和多核架构的快速社区检测算法

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.61

Jyothish Soman, A. Narang

{"title":"Fast Community Detection Algorithm with GPUs and Multicore Architectures","authors":"Jyothish Soman, A. Narang","doi":"10.1109/IPDPS.2011.61","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.61","url":null,"abstract":"In this paper, we present the design of a novel scalable parallel algorithm for community detection optimized for multi-core and GPU architectures. Our algorithm is based on label propagation, which works solely on local information, thus giving it the scalability advantage over conventional approaches. We also show that weighted label propagation can overcome typical quality issues in communities detected with label propagation. Experimental results on well known massive scale graphs such as Wikipedia (100M edges) and also on RMAT graphs with 10M - 40M edges, demonstrate the superior performance and scalability of our algorithm compared to the well known approaches for community detection. On the textit{hep-th} graph ($352$K edges) and the textit{wikipedia} graph ($100$M edges), using Power 6 architecture with $32$ cores, our algorithm achieves one to two orders of magnitude better performance compared to the best known prior results on parallel architectures with similar number of CPUs. Further, our GPGPU based algorithm achieves $8times$ improvement over the Power 6 performance on $40$M edge R-MAT graph. Alongside, we achieve high quality (modularity) of communities detected, with experimental evidence from well-known graphs such as Zachary karate club, Dolphin network and Football club, where we achieve modularity that is close to the best known alternatives. To the best of our knowledge these are best known results for community detection on massive graphs ($100$M edges) in terms of performance and also quality vs. performance trade-off. This is also a unique work on community detection on GPGPUs with scalable performance.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123086666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 73

Iso-Energy-Efficiency: An Approach to Power-Constrained Parallel Computation 等能源效率:一种功率受限的并行计算方法

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.22

S. Song, Chun-Yi Su, Rong Ge, Abhinav Vishnu, K. Cameron

{"title":"Iso-Energy-Efficiency: An Approach to Power-Constrained Parallel Computation","authors":"S. Song, Chun-Yi Su, Rong Ge, Abhinav Vishnu, K. Cameron","doi":"10.1109/IPDPS.2011.22","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.22","url":null,"abstract":"Future large scale high performance supercomputer systems require high energy efficiency to achieve exaflops computational power and beyond. Despite the need to understand energy efficiency in high-performance systems, there are few techniques to evaluate energy efficiency at scale. In this paper, we propose a system-level iso-energy-efficiency model to analyze, evaluate and predict energy-performance of data intensive parallel applications with various execution patterns running on large scale power-aware clusters. Our analytical model can help users explore the effects of machine and application dependent characteristics on system energy efficiency and isolate efficient ways to scale system parameters (e.g. processor count, CPU power/frequency, workload size and network bandwidth) to balance energy use and performance. We derive our iso-energy-efficiency model and apply it to the NAS Parallel Benchmarks on two power-aware clusters. Our results indicate that the model accurately predicts total system energy consumption within 5% error on average for parallel applications with various execution and communication patterns. We demonstrate effective use of the model for various application contexts and in scalability decision-making.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"126 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126856307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 54

Architectural Constraints to Attain 1 Exaflop/s for Three Scientific Application Classes 三个科学应用类达到1 Exaflop/s的架构限制

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.18

A. Bhatele, Pritish Jetley, Hormozd Gahvari, Lukasz Wesolowski, W. Gropp, L. Kalé

{"title":"Architectural Constraints to Attain 1 Exaflop/s for Three Scientific Application Classes","authors":"A. Bhatele, Pritish Jetley, Hormozd Gahvari, Lukasz Wesolowski, W. Gropp, L. Kalé","doi":"10.1109/IPDPS.2011.18","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.18","url":null,"abstract":"The first Teraflop/s computer, the ASCI Red, became operational in 1997, and it took more than 11 years for a Petaflop/s performance machine, the IBM Roadrunner, to appear on the Top500 list. Efforts have begun to study the hardware and software challenges for building an exascale machine. It is important to understand and meet these challenges in order to attain Exaflop/s performance. This paper presents a feasibility study of three important application classes to formulate the constraints that these classes will impose on the machine architecture for achieving a sustained performance of 1 Exaflop/s. The application classes being considered in this paper are -- classical molecular dynamics, cosmological simulations and unstructured grid computations (finite element solvers). We analyze the problem sizes required for representative algorithms in each class to achieve 1 Exaflop/s and the hardware requirements in terms of the network and memory. Based on the analysis for achieving an Exaflop/s, we also discuss the performance of these algorithms for much smaller problem sizes.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"376 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122452720","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

Efficient Parallel Scheduling of Malleable Tasks 可延展任务的高效并行调度

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.110

P. Sanders, Jochen Speck

引用次数: 14

The Evaluation of an Effective Out-of-Core Run-Time System in the Context of Parallel Mesh Generation 并行网格生成环境下有效的离核运行时系统评价

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.25

A. Kot, Andrey N. Chernikov, N. Chrisochoides

{"title":"The Evaluation of an Effective Out-of-Core Run-Time System in the Context of Parallel Mesh Generation","authors":"A. Kot, Andrey N. Chernikov, N. Chrisochoides","doi":"10.1109/IPDPS.2011.25","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.25","url":null,"abstract":"We present an out-of-core run-time system that supports effective parallel computation of large irregular and adaptive problems, in particular unstructured mesh generation (PUMG). PUMG is a highly challenging application due to intensive memory accesses, unpredictable communication patterns, and variable and irregular data dependencies reflecting the unstructured spatial connectivity of mesh elements. Our runtime system allows to transform the footprint of parallel applications from wide and shallow into narrow and deep by extending the memory utilization to the out-of-core level. It simplifies and streamlines the development of otherwise highly time consuming out-of-core applications as well as the converting of existing applications. It utilizes disk, network and memory hierarchy to achieve high utilization of computing resources without sacrificing performance with PUMG. The runtime system combines different programming paradigms: multi-threading within the nodes using industrial strength software framework, one-sided active messages among the nodes, and an out-of-core subsystem for managing large datasets. We performed an evaluation on traditional parallel platforms to stress test all layers of the run-time system using three different PUMG methods with significantly varying communication and synchronization patterns. We demonstrated high overlap in computation, communication, and disk I/O which results in good performance when computing large out-of-core problems. The runtime system adds very small overhead~(up to 18% on most configurations) when computing in-core which means performance is not compromised.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128786190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

A Novel Power Management for CMP Systems in Data-Intensive Environment 数据密集型环境下CMP系统的一种新型电源管理方法

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.19

Pengju Shang, Jun Wang

{"title":"A Novel Power Management for CMP Systems in Data-Intensive Environment","authors":"Pengju Shang, Jun Wang","doi":"10.1109/IPDPS.2011.19","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.19","url":null,"abstract":"The emerging data-intensive applications of today are comprised of non-uniform CPU and I/O intensive workloads, thus imposing a requirement to consider both CPU and I/O effects in the power management strategies. Only scaling down the processor's frequency based on its busy/idle ratio cannot fully exploit opportunities of saving power. Our experiments show that besides the busy and idle status, each processor may also have I/O wait phases waiting for I/O operations to complete. During this period, the completion time is decided by the I/O subsystem rather than the CPU thus scaling the processor to a lower frequency will not affect the performance but save more power. In addition, the CPU's reaction to the I/O operations may be significantly affected by several factors, such as I/O type (sync or unsync), instruction/job level parallelism, it cannot be accurately modeled via physics laws like mechanical or chemical systems. In this paper, we propose a novel power management scheme called MAR (modeless, adaptive, rule-based) in multiprocessor systems to minimize the CPU power consumption under performance constraints. By using richer feedback factors, e.g. the I/O wait, MAR is able to accurately describe the relationships among core frequencies, performance and power consumption. We adopt a modeless control model to reduce the complexity of system modeling. MAR is designed for CMP (Chip Multi Processor) systems by employing multi-input/multi-output (MIMO) theory and per core level DVFS (Dynamic Voltage and Frequency Scaling). Our extensive experiments on a physical test bed demonstrate that, for the SPEC benchmark and data-intensive (TPC-C) benchmark, the efficiency of MAR is 93.6-96.2% accurate to the ideal power saving strategy calculated off-line. Compared with baseline solutions, MAR could save 22.5-32.5% more power while keeping the comparable performance loss of about 1.8-2.9%. In addition, simulation results show the efficiency of our design for various CMP configurations.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128240602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Parallel Metagenomic Sequence Clustering Via Sketching and Maximal Quasi-clique Enumeration on Map-Reduce Clouds 基于Map-Reduce云上草图绘制和最大拟团枚举的并行宏基因组序列聚类

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.116

X. Yang, J. Zola, S. Aluru

{"title":"Parallel Metagenomic Sequence Clustering Via Sketching and Maximal Quasi-clique Enumeration on Map-Reduce Clouds","authors":"X. Yang, J. Zola, S. Aluru","doi":"10.1109/IPDPS.2011.116","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.116","url":null,"abstract":"Taxonomic clustering of species is an important and frequently arising problem in metagenomics. High-throughput next generation sequencing is facilitating the creation of large metagenomic samples, while at the same time making the clustering problem harder due to the short sequence length supported and unknown species sampled. In this paper, we present a parallel algorithm for hierarchical taxonomic clustering of large metagenomic samples with support for overlapping clusters. We adapt the sketching techniques originally developed for web document clustering to deduce significant similarities between pairs of sequences without resorting to expensive all vs. all alignments. We formulate the metagenomics classification problem as that of maximal quasi-clique enumeration in the resulting similarity graph, at multiple levels of the hierarchy as prescribed by different similarity thresholds. We cast execution of the underlying algorithmic steps as applications of the map-reduce framework to achieve a cloud based implementation. Apart from solving an important problem in metagenomics, this work demonstrates the applicability of map-reduce framework in relatively complicated algorithmic settings.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129155717","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 23

Tolerant Value Speculation in Coarse-Grain Streaming Computations 粗粒度流计算中的容值推测

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.54

Nathaniel Azuelos, I. Keidar, A. Zaks

引用次数: 1

LACIO: A New Collective I/O Strategy for Parallel I/O Systems LACIO:并行I/O系统的一种新的集体I/O策略

2011 IEEE International Parallel & Distributed Processing Symposium Pub Date : 2011-05-16 DOI: 10.1109/IPDPS.2011.79

Yong Chen, Xian-He Sun, R. Thakur, P. Roth, W. Gropp

{"title":"LACIO: A New Collective I/O Strategy for Parallel I/O Systems","authors":"Yong Chen, Xian-He Sun, R. Thakur, P. Roth, W. Gropp","doi":"10.1109/IPDPS.2011.79","DOIUrl":"https://doi.org/10.1109/IPDPS.2011.79","url":null,"abstract":"Parallel applications benefit considerably from the rapid advance of processor architectures and the available massive computational capability, but their performance suffers from large latency of I/O accesses. The poor I/O performance has been attributed as a critical cause of the low sustained performance of parallel systems. Collective I/O is widely considered a critical solution that exploits the correlation among I/O accesses from multiple processes of a parallel application and optimizes the I/O performance. However, the conventional collective I/O strategy makes the optimization decision based on the logical file layout to avoid multiple file system calls and does not take the physical data layout into consideration. On the other hand, the physical data layout in fact decides the actual I/O access locality and concurrency. In this study, we propose a new collective I/O strategy that is aware of the underlying physical data layout. We confirm that the new Layout-Aware Collective I/O (LACIO) improves the performance of current parallel I/O systems effectively with the help of noncontiguous file system calls. It holds promise in improving the I/O performance for parallel systems.","PeriodicalId":355100,"journal":{"name":"2011 IEEE International Parallel & Distributed Processing Symposium","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121294421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 41