C. Boneti, R. Gioiosa, F. Cazorla, J. Corbalán, Jesús Labarta, M. Valero
{"title":"Balancing HPC applications through smart allocation of resources in MT processors","authors":"C. Boneti, R. Gioiosa, F. Cazorla, J. Corbalán, Jesús Labarta, M. Valero","doi":"10.1109/IPDPS.2008.4536293","DOIUrl":"https://doi.org/10.1109/IPDPS.2008.4536293","url":null,"abstract":"Many studies have shown that load imbalancing causes significant performance degradation in high performance computing (HPC) applications. Nowadays, multi-threaded (MT1) processors are widely used in HPC for their good performance/energy consumption and performance/cost ratios achieved sharing internal resources, like the instruction window or the physical register. Some of these processors provide the software hardware mechanisms for controlling the allocation of processor's internal resources. In this paper, we show, for the first time, that by appropriately using these mechanisms, we are able to control the tasks speed, reducing the imbalance in parallel applications transparently to the user and, hence, reducing the total execution time. Our results show that our proposal leads to a performance improvement up to 18% for one of the NAS benchmark. For a real HPC application (much more dynamic than the benchmark) the performance improvement is 8.1%. Our results also show that, if resource allocation is not used properly, the imbalance of applications is worsened causing performance loss.","PeriodicalId":162608,"journal":{"name":"2008 IEEE International Symposium on Parallel and Distributed Processing","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127661308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Ubal, J. Sahuquillo, S. Petit, P. López, J. Duato
{"title":"The impact of out-of-order commit in coarse-grain, fine-grain and simultaneous multithreaded architectures","authors":"R. Ubal, J. Sahuquillo, S. Petit, P. López, J. Duato","doi":"10.1109/IPDPS.2008.4536284","DOIUrl":"https://doi.org/10.1109/IPDPS.2008.4536284","url":null,"abstract":"Multithreaded processors in their different organizations (simultaneous, coarse grain and fine grain) have been shown as effective architectures to reduce the issue waste. On the other hand, retiring instructions from the pipeline in an out-of-order fashion helps to unclog the ROB when a long latency instruction reaches its head. This further contributes to maintain a higher utilization of the available issue bandwidth. In this paper, we evaluate the impact of retiring instructions out of order on different multithreaded architectures and different instruction fetch policies, using the recently proposed Validation Buffer microarchitecture as baseline out-of-order commit technique. Experimental results show that, for the same performance, out-of-order commit permits to reduce multithread hardware complexity (e.g., fine grain multithreading with a lower number of supported threads).","PeriodicalId":162608,"journal":{"name":"2008 IEEE International Symposium on Parallel and Distributed Processing","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131302920","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Monitoring for multi-middleware grid","authors":"G. Poghosyan, M. Kunze","doi":"10.1109/IPDPS.2008.4536209","DOIUrl":"https://doi.org/10.1109/IPDPS.2008.4536209","url":null,"abstract":"Within the framework of the German Grid Computing Initiative (D-Grid), we study the monitoring systems and software suites that are used to collect the information from computational grids working with single or multiple middleware systems. Based on these investigations we build the prototypes of monitoring systems and implement it in the D-Grid infrastructure. A concept of Site Check Center (SCC) suggested to providing a unified interface for access to data from different test-benchmark systems working with more than one middleware software. A Vertical hierarchal architecture for exchange of information and building the network of monitoring systems is suggested and employed. A concept for separation between consumer and resource/service provider related monitoring information is proposed. Furthermore, we study the integration of monitoring components into general computational multi- middleware grid infrastructure developed according to specific community needs.","PeriodicalId":162608,"journal":{"name":"2008 IEEE International Symposium on Parallel and Distributed Processing","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133802672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A survey of concurrent priority queue algorithms","authors":"Kristijan Dragicevic, D. Bauer","doi":"10.1109/IPDPS.2008.4536331","DOIUrl":"https://doi.org/10.1109/IPDPS.2008.4536331","url":null,"abstract":"Algorithms for concurrent data structures have gained attention in recent years as multi-core processors have become ubiquitous. Using the example of a concurrent priority queue, this paper investigates different synchronization methods and concurrent algorithms. It covers traditional lock-based approaches, non-blocking algorithms as well as a method based on software transactional memory. Besides discussing correctness criteria for the various approaches, we also present performance results for all algorithms for various scenarios. Somewhat surprisingly, we find that a simple lock-based approach performs reasonable well, even though it does not scale with the number of threads. Better scalability is achieved by non-blocking approaches.","PeriodicalId":162608,"journal":{"name":"2008 IEEE International Symposium on Parallel and Distributed Processing","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115434348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Byung-Hoon Park, Matthew C. Schmidt, K. Thomas, T. Karpinets, N. Samatova
{"title":"Parallel, scalable, memory-efficient backtracking for combinatoria modeling of large-scale biological systems","authors":"Byung-Hoon Park, Matthew C. Schmidt, K. Thomas, T. Karpinets, N. Samatova","doi":"10.1109/IPDPS.2008.4536180","DOIUrl":"https://doi.org/10.1109/IPDPS.2008.4536180","url":null,"abstract":"Data-driven modeling of biological systems such as protein- protein interaction networks is data-intensive and combinatorially challenging. Backtracking can constrain a combinatorial search space. Yet, its recursive nature, exacerbated by data-intensity, limits its applicability for large-scale systems. Parallel, scalable, and memory-efficient backtracking is a promising approach. Parallel backtracking suffers from unbalanced loads. Load rebalancing via synchronization and data movement is prohibitively expensive. Balancing these discrepancies, while minimizing end-to-end execution time and memory requirements, is desirable. This paper introduces such a framework. Its scalability and efficiency, demonstrated on the maximal clique enumeration problem, are attributed to the proposed: (a) representation of search tree decomposition to enable parallelization; (b) depth-first parallel search to minimize memory requirement; (c) least stringent synchronization to minimize data movement; and (d) on-demand work stealing with stack splitting to minimize processors' idle time. The applications of this framework to real biological problems related to bioethanol production are discussed.","PeriodicalId":162608,"journal":{"name":"2008 IEEE International Symposium on Parallel and Distributed Processing","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115493643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Olivier Beaumont, N. Bonichon, Lionel Eyraud-Dubois
{"title":"Scheduling divisibleworkloads on heterogeneous platforms under bounded multi-port model","authors":"Olivier Beaumont, N. Bonichon, Lionel Eyraud-Dubois","doi":"10.1109/IPDPS.2008.4536170","DOIUrl":"https://doi.org/10.1109/IPDPS.2008.4536170","url":null,"abstract":"In this paper, we discuss complexity issues for scheduling divisible workloads on heterogeneous systems under the bounded multi-port model. To our best knowledge, this paper is the first attempt to consider divisible load scheduling under a realistic communication model, where the master node can communicate simultaneously to several slaves, provided that bandwidth constraints are not exceeded. In this paper, we concentrate on one round distribution schemes, where a given node starts its processing only once all data has been received. Our main contributions are (i) the proof that processors start working immediately after receiving their work (ii) the study of the optimal schedule in the case of 2 processors and (iii) the proof that scheduling divisible load under the bounded multi-port model is NP-complete. This last result strongly differs from divisible load literature and represents the first NP-completeness result when latencies are not taken into account.","PeriodicalId":162608,"journal":{"name":"2008 IEEE International Symposium on Parallel and Distributed Processing","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124519391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Vijay S. Kumar, Mary W. Hall, J. Kim, Y. Gil, T. Kurç, E. Deelman, V. Ratnakar, J. Saltz
{"title":"Designing and parameterizing a workflow for optimization: A case study in biomedical imaging","authors":"Vijay S. Kumar, Mary W. Hall, J. Kim, Y. Gil, T. Kurç, E. Deelman, V. Ratnakar, J. Saltz","doi":"10.1109/IPDPS.2008.4536411","DOIUrl":"https://doi.org/10.1109/IPDPS.2008.4536411","url":null,"abstract":"This paper describes our experience to date employing the systematic mapping and optimization of large- scale scientific application workflows to current and future parallel platforms. The overall goal of the project is to integrate a set of system layers - application program, compiler, run-time environment, knowledge representation, optimization framework, and workflow manager - and through a systematic strategy for workflow mapping, our approach will exploit the vast machine resources available in such parallel platforms to dramatically increase the productivity of application programmers. In this paper, we describe the representation of a biomedical imaging application as a workflow, our early experiences in integrating the set of tools brought together for this project, and implications for future applications.","PeriodicalId":162608,"journal":{"name":"2008 IEEE International Symposium on Parallel and Distributed Processing","volume":"55 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115044363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Synchronized send operations for efficient streaming block I/O over Myrinet","authors":"Evangelos Koukis, Anastassios Nanos, N. Koziris","doi":"10.1109/IPDPS.2008.4536142","DOIUrl":"https://doi.org/10.1109/IPDPS.2008.4536142","url":null,"abstract":"Providing scalable clustered storage in a cost-effective way depends on the availability of an efficient network block device (nbd) layer. We study the performance of gmblock, an nbd server over Myrinet utilizing a direct disk-to-NIC data path which bypasses the CPU and main memory bus. To overcome the architectural limitation of a low number of outstanding requests, we focus on overlapping read and network I/O for a single request, in order to improve throughput. To this end, we introduce the concept of synchronized send operations and present an implementation on Myrinet/GM, based on custom modifications to the NIC firmware and associated userspace library. Compared to a network block sharing system over standard GM and the base version of gmblock, our enhanced implementation supporting synchronized sends delivers 81% and 44% higher throughput for streaming block I/O, respectively.","PeriodicalId":162608,"journal":{"name":"2008 IEEE International Symposium on Parallel and Distributed Processing","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116945088","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A predicate-based approach to dynamic protocol update in group communication","authors":"Olivier Rütti, A. Schiper","doi":"10.1109/IPDPS.2008.4536238","DOIUrl":"https://doi.org/10.1109/IPDPS.2008.4536238","url":null,"abstract":"In this paper we study dynamic protocol updates (DPU), which consist in replacing, without interruption, a given protocol during execution. We focus especially on group communication protocols. The paper proposes a methodology to conveniently describe which protocols are correctly replaced by a given DPU algorithm. More precisely, our methodology characterizes DPU algorithms by a set of inference rules. To validate our approach, we illustrate our methodology with a new DPU algorithm.","PeriodicalId":162608,"journal":{"name":"2008 IEEE International Symposium on Parallel and Distributed Processing","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116963286","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Receiver-initiated message passing over RDMA Networks","authors":"S. Pakin","doi":"10.1109/IPDPS.2008.4536262","DOIUrl":"https://doi.org/10.1109/IPDPS.2008.4536262","url":null,"abstract":"Providing point-to-point messaging-passing semantics atop Put/Get hardware traditionally involves implementing a protocol comprising three network latencies. In this paper, we analyze the performance of an alternative implementation approach - receiver-initiated message passing - that eliminates one of the three network latencies. Performance measurements taken on the Cell Broadband Engine indicate that receiver-initiated message passing exhibits substantially lower latency than standard, sender-initiated message passing.","PeriodicalId":162608,"journal":{"name":"2008 IEEE International Symposium on Parallel and Distributed Processing","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2008-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116991669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}