Thomas Preud'homme, Julien Sopena, Gaël Thomas, B. Folliot
{"title":"BatchQueue: Fast and Memory-Thrifty Core to Core Communication","authors":"Thomas Preud'homme, Julien Sopena, Gaël Thomas, B. Folliot","doi":"10.1109/SBAC-PAD.2010.34","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2010.34","url":null,"abstract":"Sequential applications can take advantage of multi-core systems by way of pipeline parallelism to improve their performance. In such parallelism, core to core communication overhead is the main limit of speedup. This paper presents BatchQueue, a fast and memory-thrifty core to core communication system based on batch processing of whole cache line. BatchQueue is able to send a 32bit word of data in just 12.5 ns on a Xeon X5472 and only needs 2 full cache lines plus 3 byte-sized variables — each on a different cache line for optimal performance — to work. The characteristics of BatchQueue — high throughput and increased latency resulting from its batch processing — makes it well suited for highly communicative tasks with no real time requirements such as monitoring.","PeriodicalId":432670,"journal":{"name":"2010 22nd International Symposium on Computer Architecture and High Performance Computing","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115392317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An Analytical Model on the Execution of Transactional Memory","authors":"Xiao Yu, Zhengyu He, Bo Hong","doi":"10.1109/SBAC-PAD.2010.29","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2010.29","url":null,"abstract":"In this paper, we develop an analytical model of the execution of transactional memory (TM) systems. This model employs queuing theory to analyze the impact of an essential set of TM design parameters including the conflict rate, number of checkpoints, and implementation overhead, etc. The model is validated via extensive experiments. To demonstrate the effectiveness of the model, we further study the performance impact of two factors. Our study shows that, for a given TM-based program, the frequency of performing checkpoint can be carefully chosen to minimize the mean transaction completion time. Our study also demonstrated the importance of reducing implementation overhead.","PeriodicalId":432670,"journal":{"name":"2010 22nd International Symposium on Computer Architecture and High Performance Computing","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115687811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Iván Cores, Gabriel Rodríguez, María J. Martín, P. González
{"title":"Achieving Fault Tolerance on Grids with the CPPC Framework and the GridWay Metascheduler","authors":"Iván Cores, Gabriel Rodríguez, María J. Martín, P. González","doi":"10.1109/SBAC-PAD.2010.22","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2010.22","url":null,"abstract":"Grids have brought a significant increase in the number of available resources that can be provided to applications. In the last decade, an important effort has been made to develop middleware that provides grids with functionalities related to application execution. However, support for fault-tolerant executions is either lacking or limited. This paper presents an experience to endow with fault tolerance support parallel executions on grids through the integration of CPPC, a check pointing tool for parallel applications, and Grid Way, a well-known met scheduler provided with the Globus Toolkit. Since both tools are not immediately compatible, a new architecture, called CPPC-GW, has been designed and implemented to allow for the transparent execution of CPPC applications through Grid Way. The performance of the solution has been evaluated using the NAS Parallel Benchmarks. Detailed experimental results show the low overhead of the approach.","PeriodicalId":432670,"journal":{"name":"2010 22nd International Symposium on Computer Architecture and High Performance Computing","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132169949","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Antonio García-Guirado, Ricardo Fernández Pascual, José M. García
{"title":"Analyzing Cache Coherence Protocols for Server Consolidation","authors":"Antonio García-Guirado, Ricardo Fernández Pascual, José M. García","doi":"10.1109/SBAC-PAD.2010.31","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2010.31","url":null,"abstract":"Server consolidation is commonly used today to make the most out of all the cores of a chip multiprocessor by running several virtual machines (VMs) on it. Cache coherence protocols can be adapted to take advantage of such an scenario. In this line, Virtual Hierarchies (VHs) use two levels of cache coherence in a consolidated server. They isolate the coherence actions of each VM and improve performance by maximizing the number of memory accesses serviced by caches within the VM. In this paper we show how hierarchical protocols with no single ordering point for the requests, such as VHs in the form currently proposed, are prone to deadlocks. Besides, when memory deduplication is used, VHs cannot take advantage of memory deduplication at the cache level, both because deduplicated data is reduplicated in cache, and because accesses to deduplicated data often require the access to the cache tiles used by a different VM by means of broadcast. We analyze all these problems and we propose solutions for them, showing the actual performance of these protocols, and giving some insights for the future development of coherence protocols optimized for server consolidation.","PeriodicalId":432670,"journal":{"name":"2010 22nd International Symposium on Computer Architecture and High Performance Computing","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130850618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
George Teodoro, Nathan Mariano, Wagner Meira Jr, R. Ferreira
{"title":"Tree Projection-Based Frequent Itemset Mining on Multicore CPUs and GPUs","authors":"George Teodoro, Nathan Mariano, Wagner Meira Jr, R. Ferreira","doi":"10.1109/SBAC-PAD.2010.15","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2010.15","url":null,"abstract":"Frequent itemset mining (FIM) is a core operation for several data mining applications as association rules computation, correlations, document classification, and many others, which has been extensively studied over the last decades. Moreover, databases are becoming increasingly larger, thus requiring a higher computing power to mine them in reasonable time. At the same time, the advances in high performance computing platforms are transforming them into hierarchical parallel environments equipped with multi-core processors and many-core accelerators, such as GPUs. Thus, fully exploiting these systems to perform FIM tasks poses as a challenging and critical problem that we address in this paper. We present efficient multi-core and GPU accelerated parallelizations of the Tree Projection, one of the most competitive FIM algorithms. The experimental results show that our Tree Projection implementation scales almost linearly in a CPU shared-memory environment after careful optimizations, while the GPU versions are up to 173 times faster than standard the CPU version.","PeriodicalId":432670,"journal":{"name":"2010 22nd International Symposium on Computer Architecture and High Performance Computing","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128217023","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Simultaneous Evaluation of Multiple I/O Strategies","authors":"Pilar González-Férez, J. Piernas, Toni Cortes","doi":"10.1109/SBAC-PAD.2010.30","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2010.30","url":null,"abstract":"We present a framework for simulating the performance obtained by different I/O system mechanisms and algorithms at the same time, and for dynamically turning them on and off to improve the overall system performance. A key element of this framework is the the design and implementation of a virtual disk inside the Linux kernel. Our virtual disk creates a virtual block device which is able to simulate any hard drive with a negligible overhead, without interfering with regular I/O requests. We describe the potential of our proposal in REDCAP, a RAM-based disk cache which is dynamically activated/deactivated according to the throughput achieved. The results show that, by using our virtual disk, REDCAP obtains its maximum possible improvements: up to 80% for workloads with some spatial locality, and the same performance as a ``normal system'' for workloads with random or large sequential reads.","PeriodicalId":432670,"journal":{"name":"2010 22nd International Symposium on Computer Architecture and High Performance Computing","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128938774","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
L. José, Senger Márcio Augusto de Souza, D. Foltran
{"title":"Towards a Peer-to-Peer Framework for Parallel and Distributed Computing","authors":"L. José, Senger Márcio Augusto de Souza, D. Foltran","doi":"10.1109/SBAC-PAD.2010.23","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2010.23","url":null,"abstract":"This paper presents a framework for developing and executing parallel and distributed applications using the peer-to-peer computing model. The framework - called P2PComp - follows the main philosophy of the pure peer-to-peer model, since there is no hierarchy among the peers, all peers have the same functions and there is no central authority server responsible for the system organization. SPMD parallel applications can be implemented by extending the framework functionalities, which includes functions for starting and monitoring processes, searching resources and communicating by message passing. This paper presents a detailed description of the framework and examples of its utilization for building and executing parallel applications. The results obtained show that the framework can be effectively used for executing computational programs in a flexible peer-to-peer environment.","PeriodicalId":432670,"journal":{"name":"2010 22nd International Symposium on Computer Architecture and High Performance Computing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128858012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Czekster, Paulo Fernandes, Afonso Sales, T. Webber
{"title":"Performance Issues for Parallel Implementations of Bootstrap Simulation Algorithm","authors":"R. Czekster, Paulo Fernandes, Afonso Sales, T. Webber","doi":"10.1109/SBAC-PAD.2010.28","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2010.28","url":null,"abstract":"The solution of state-based stochastic models is usually a demanding application, then it is a natural subject to high performance techniques. We are particularly interested in the speedup of Bootstrap Simulation of structured Markovian models. This approach is a quite recent development in the performance evaluation area, and it brings a considerable improvement in the results accuracy, despite the intrinsic effect of randomness in simulation experiments. Unfortunately, Bootstrap Simulation has higher computational cost than other alternatives. We present experiments with different options to optimize the parallel solution of Bootstrap Simulation applied to three practical examples described in Stochastic Automata Networks (SAN) formalism. This paper contribution resides in the discussion of theoretical implementation issues, the obtained speedup and the actual processing and communication times for all experiments. Additionally, we also suggest future works to improve even more the proposed solution and we discuss some interesting insights for parallelization of similar applications.","PeriodicalId":432670,"journal":{"name":"2010 22nd International Symposium on Computer Architecture and High Performance Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130848021","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Clock Synchronization Strategy for Minimizing Clock Variance at Runtime in High-End Computing Environments","authors":"T. Jones, G. Koenig","doi":"10.1109/SBAC-PAD.2010.33","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2010.33","url":null,"abstract":"We present a new software-based clock synchronization scheme that provides high precision time agreement among distributed memory nodes. The technique is designed to minimize variance from a reference chimer during runtime and with minimal time-request latency. Our scheme permits initial unbounded variations in time and corrects both slow and fast chimers (clock skew). An implementation developed within the context of the MPI message passing interface is described and time coordination measurements are presented. Among our results, the mean time variance among a set of nodes improved from 20.0 milliseconds under standard Network Time Protocol (NTP) to 2.29 μsecs under our scheme.","PeriodicalId":432670,"journal":{"name":"2010 22nd International Symposium on Computer Architecture and High Performance Computing","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-10-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115573538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}