S. Raikar, H. Subramoni, K. Kandalla, Jérôme Vienne, D. Panda
{"title":"Designing Network Failover and Recovery in MPI for Multi-Rail InfiniBand Clusters","authors":"S. Raikar, H. Subramoni, K. Kandalla, Jérôme Vienne, D. Panda","doi":"10.1109/IPDPSW.2012.142","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.142","url":null,"abstract":"The emerging trends of designing commodity-based supercomputing systems have a severe detrimental impact on the Mean-Time-Between-Failures (MTBF). The MTBF for typical HEC installations is currently estimated to be between eight hours and fifteen days [1]. Failures in the interconnect fabric account for a fair share of the total failures occurring in such systems. This will continue to degrade as system sizes become larger. Thus, it is highly desirable that next generation system architectures and software environments provide sophisticated network level fault-tolerance and fault-resilient solutions. In the past few years, the number of cores on processors has increased dramatically. To make efficient use of these machines it is necessary to provide the required bandwidth to all the cores. To keep up with the multi-core trend, current generation supercomputers and clusters are designed with multiple network cards (rails) to provide enhanced data transfer capabilities. Besides providing enhanced performance, such multi-rail networks can also be leveraged to provide network level fault resilience. This paper presents a design for a failover mechanism in a multi-rail scenario, for handling network failures and their recovery without compromising on performance. In a general message passing scenario, whenever there is a network failure, the entire job aborts. Our design allows the job to continue even when a network failure occurs, by using the remaining rails for communication. Once the rail recovers from the failure, we also propose a protocol to re-establish connections on that rail and resume normal operations. We experimentally demonstrate that our implementation adds very little overhead and is able to deliver good performance which is comparable to that of the other rails running in isolation. We also show that the recovery is immediate and is associated with no additional overhead. We also depict sustenance and reliability of the design by running application benchmarks with permanent failures.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116993488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yongqing Zheng, Jinshan Pang, Jian Li, Li-zhen Cui
{"title":"Business Process Oriented Platform-as-a-Service Framework for Process Instances Intensive Applications","authors":"Yongqing Zheng, Jinshan Pang, Jian Li, Li-zhen Cui","doi":"10.1109/IPDPSW.2012.284","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.284","url":null,"abstract":"With the cloud computing becoming more and more popularity in both commercial and academic fields, platform-as-a Service (PaaS) becomes one of the core technologies for service provider to change the way of service-providing to both common users and scientific organization. This paper describes a business process oriented Platform-as-a-Service framework called BPPaaS including an integrated business process application programming model, and business process oriented Platform-as-a-Service middleware. BPPaaS can enable users to submit their business process logic source code programmed by integrated business process programming language to this platform. And BPPaaS will parse the logic source code, extract the business process tasks and task-relationship to form meta-data, and encode business process tasks as standalone executable components. Since different cloud data center has specific data, BPPaaS will assign the business process tasks to the specific data center as task execution nodes, which have the necessary data required by tasks. A scheduling algorithm is introduced to supporting business process intensive application execution with multiple heterogeneous java runtime environments as the underling parallel computation platform. Finally, a case in social security application shows this framework can streamline complex computational business process.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124772444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Analysis and Optimization of Data Import with Hadoop","authors":"Weijia Xu, Wei Luo, N. Woodward","doi":"10.1109/IPDPSW.2012.129","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.129","url":null,"abstract":"Data driven research has become an important part of scientific discovery in an increasing number of disciplines. In many cases, the sheer volume of data to be processed requires not only state-of-the-art computing resources but also carefully tuned and specifically developed software. These requirements are often associated with huge operational costs and significant expertise in software development. Due to its simplicity for the user and effectiveness at processing big data, Hadoop has become a popular software platform for large-scale data analysis. Using a Hadoop cluster in a remote shared infrastructure enables users to avoid the costs of maintaining a physical infrastructure. An inevitable step in using dynamically constructed Hadoop cluster is the initial importing of the data. This process is not trivial, particularly when the size of the data is large. In this paper, we evaluate the costs of importing large-scale data into a Hadoop cluster. We present a detailed analysis of the default data importing implementation in Hadoop and conduct a practical evaluation. Our evaluation includes tests with different hardware configurations, such as different network protocol and disk configurations. We also propose an implementation to improve the performance of importing data into a Hadoop cluster wherein the data is accessed directly by Data nodes during the import process.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126111188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Optimized Reduce for Mesh-Based NoC Multiprocessors","authors":"A. Kohler, M. Radetzki","doi":"10.1109/IPDPSW.2012.111","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.111","url":null,"abstract":"Future processors are expected to be made up of a large number of computation cores interconnected by fast on-chip networks (Network-on-Chip, NoC). Such distributed structures motivate the use of message passing programming models similar to MPI. Since the properties of these networks, like e.g. the topology, are known and fixed after production, this knowledge can be used to optimize the communication stack. We describe two schemes that take advantage of this to accelerate the (All-)Reduce operation defined in MPI, namely a contention avoiding rank-to-core mapping and a way of interleaving communication and computation by means of pipelining. Simulations show that the combination of both schemes can accelerate (All-)Reduce operations by more than 60%.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123569843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"On Dynamic Run-time Processor Pipeline Reconfiguration","authors":"Carsten Tradowsky, F. Thoma, M. Hübner, J. Becker","doi":"10.1109/IPDPSW.2012.53","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.53","url":null,"abstract":"Adaptation of hardware in relation to the requirements of a specific application is well known and investigated in the domain of Field Programmable Gate Arrays (FPGA) based reconfigurable system architectures. In these system approaches, a number of predefined blocks, mainly accelerators for processors, are loaded from an external storage and are transferred to the FPGA configuration memory in order to manipulate the on-chip functionality. A novel approach is to adapt the micro architecture of a processor in order to achieve a temporal application-specific behavior. In combination with the well known techniques of dynamic reconfiguration of a FPGA, novel degrees of freedom are available for an energy efficient run-time dynamic system approach. This paper presents one adaptation mechanism, in which the pipeline depth is adapted according to the control flow and data flow of an application. The concept and also the realization are described and evaluated in terms of efficiency with some benchmarks.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123775649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Implementation and Evaluation of Triple Precision BLAS Subroutines on GPUs","authors":"Daichi Mukunoki, D. Takahashi","doi":"10.1109/IPDPSW.2012.175","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.175","url":null,"abstract":"We implemented and evaluated the triple precision Basic Linear Algebra Subprograms (BLAS) subroutines, AXPY, GEMV and GEMM on a Tesla C2050. In this paper, we present a Double Single (D+S) type triple precision floating-point value format and operations. They are based on techniques similar to Double-Double (DD) type quadruple precision operations. On the GPU, the D+S-type operations are more costly than the DD-type operations in theory and in practice. Therefore, the triple precision GEMM, which is a compute-bound operation, is slower than the quadruple precision GEMM. However, the triple precision AXPY and GEMV are memory-bound operations on the GPU, thus their execution time of these triple precision subroutines is close to 3/4 of the quadruple precision subroutines. Therefore, we conclude that the triple precision value format is useful for memory-bound operations, in cases where the quadruple precision is not required, but double precision is not sufficient.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125300248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"The Multi-Processor Scheduling Problem in Phylogenetics","authors":"Jiajie Zhang, A. Stamatakis","doi":"10.1109/IPDPSW.2012.86","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.86","url":null,"abstract":"Advances in wet-lab sequencing techniques allow for sequencing between 100 genomes up to 1000 full transcriptomes of species whose evolutionary relationships shall be disentangled by means of phylogenetic analyses. Likelihood-based evolutionary models allow for partitioning such broad phylogenomic datasets, for instance into gene regions, for which likelihood model parameters (except for the tree itself) can be estimated independently. Present day phylogenomic datasets are typically split up into 1000-10,000 distinct partitions. While the likelihood on such datasets needs to be computed in parallel because of the high memory requirements, it has not yet been assessed how to optimally distribute partitions and/or alignment sites to processors, in particular when the number of cores is significantly smaller than the number of partitions. We find that, by distributing partitions (of varying lengths) monolithically to processors, the induced load distribution problem essentially corresponds to the well-known multiprocessor scheduling problem. By implementing the simple Longest Processing Time (LPT) heuristics in the PThreads and MPI version of RAxML-Light, we were able to accelerate run times by up to one order of magnitude. Other heuristics for multi-processor scheduling such as improved MultiFit, improved Zero-One, or the Three Phase approach did not yield notable performance improvements.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126641254","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
N. Xiong, Kaihua Xu, Lilong Chen, L. Yang, Yuhua Liu
{"title":"An Effective Self-adaptive Load Balancing Algorithm for Peer-to-Peer Networks","authors":"N. Xiong, Kaihua Xu, Lilong Chen, L. Yang, Yuhua Liu","doi":"10.1109/IPDPSW.2012.179","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.179","url":null,"abstract":"The field of parallel and distributed computing has become increasingly significant as recent advances in electronic and integrated circuit technologies. Peer-to-Peer (P2P) cloud computing networks are the largest contributor of network traffic on the Internet. Measurement plays an important role in different P2P applications, we should enhance the measurement-based optimization of P2P networking and applications. In especial, to enhance the file sharing efficiency in P2P networks while reducing the inter-domain traffic, extensive schemes are proposed and file sharing is becoming seriously concerned. However, difference in ability, free-riding behavior and high churn have caused great unbalance on load degree between high speed network nodes. This paper presents a self-adaptive load balancing algorithm, where nodes create binary tree back-up node tables for their shared hot files automatically, and transfer extra query quest connection sent originally to heavy-load nodes and to back-up nodes. The experimental results reveal our algorithm can reduce load degree of heavy-load nodes and bring ideal balance between high speed network nodes, although under high churn, it also has balance effect and lower load degree of the whole network systems.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116000215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A QoS-Aware Service Selection Method for Cloud Service Composition","authors":"Huihui Bao, Wanchun Dou","doi":"10.1109/IPDPSW.2012.278","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.278","url":null,"abstract":"Many recent studies have been addressing the service selection problem based on non-functional aspects due to the ever-increasing number of web services. However, most existing works about QoS-based service composition treat the services referred in service composition as independent ones from each other, and their correlations are usually ignored. In reality, the services supplied by service providers in cloud environment are not segregate and irrelevant with each other. In view of this challenging problem, we use Finite State Machine (FSM) to prescribe the legal invocation orders of these web services, also an improved Tree-pruning-based algorithm is proposed to create the Web Service Composition Tree (WSCT). After generating all of the feasible execution paths, a Simple Additive Weighting (SAW) technique is used to select an optimal one. At last, an experiment is presented for validating the performance of the method.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"285 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116107472","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Performance Benefits of Heterogeneous Computing in HPC Workloads","authors":"V. Lee, Edward T. Grochowski, Robert Y. Geva","doi":"10.1109/IPDPSW.2012.18","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.18","url":null,"abstract":"Chip multi-processors (CMPs) with increasing number of processor cores are now becoming widely available. To take advantage of many-core CMPs, applications must be parallelized. However, due to the nature of algorithm/programming model, some parts of the application would remain serial. According to Amdahl's law, the speedup of a parallel application is limited by the amount of serial execution it has. For a CMP with many cores, this can be a serious limitation. To take full advantage of the increasing number of cores, one must try to reduce the execution time of the serial portion of a parallel program. However, rewriting an application takes time and often the return on the effort invested may not justify parallelizing every part of the program. Heterogeneous many-core CMP design is one possible solution to support massive parallel execution and to provide a reasonable single-thread performance. In this paper, we use a simple spreadsheet model to evaluate homogeneous and heterogeneous CMP designs using execution profiles of real HPC applications. Evaluated on 12 parallel HPC applications, we show that heterogeneous CMPs can outperform homogeneous CMPs by up to 1.35× with an average speedup of 1.06× when both the heterogeneous CMPs and homogeneous CMPs are constrained to use the same power budget. Our study found the heterogeneous CMPs can take advantage of serial portion of execution that is as little as 2% of total run time to provide performance benefit. This suggests heterogeneous computing can help mitigate the effect of not parallelizing some portions of an application due to return on investment concern on programming efforts.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"269 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122933560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}