H. Calandra, R. Dolbeau, P. Fortin, J. Lamotte, Issam Said
{"title":"Evaluation of Successive CPUs/APUs/GPUs Based on an OpenCL Finite Difference Stencil","authors":"H. Calandra, R. Dolbeau, P. Fortin, J. Lamotte, Issam Said","doi":"10.1109/PDP.2013.65","DOIUrl":"https://doi.org/10.1109/PDP.2013.65","url":null,"abstract":"The AMD APU (Accelerated Processing Unit) architecture, which combines CPU and GPU cores on the same die, is promising for GPU applications which performance is bottlenecked by the low PCI Express communication rate. However the first APU generations still have different CPU and GPU memory partitions. Currently, the APU integrated GPUs are also less powerful than discrete GPUs. In this paper we therefore investigate the interest of APUs for scientific computing by evaluating and comparing the performance of two successive AMD APUs (family codename Llano and Trinity), two successive discrete GPUs (chip codename Cayman and Tahiti) and one hexa-core AMD CPU. For this purpose, we rely on a 3D finite difference stencil, that is optimized and tuned in OpenCL. We detail the most interesting optimizations for each architecture and show very good performance in OpenCL: up to 500 Gflops on Tahiti. Finally, our results show that APU integrated GPUs outperform CPUs, and that integrated GPUs of upcoming APUs may match discrete GPUs for problems with high communication requirements.","PeriodicalId":202977,"journal":{"name":"2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing","volume":"99 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121117028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Block Level Storage Support for Open Source IaaS Clouds","authors":"S. Ács, M. Gergely, P. Kacsuk, M. Kozlovszky","doi":"10.1109/PDP.2013.45","DOIUrl":"https://doi.org/10.1109/PDP.2013.45","url":null,"abstract":"Cloud computing is the dominating paradigm in distributed computing. The most popular open source cloud solutions support different type of storage subsystems, because of the different needs of the deployed services (in terms of performance, flexibility, cost-effectiveness). In this paper, we investigate the supported standard and open source storage types and create a classification. We point out that the Internet Small Computer System Interface (iSCSI) based block level storage can be used for I/O intensive services currently. However, the ATA-over-Ethernet (AoE) protocol uses fewer layers and operates on lower level which makes it more lightweight and faster than iSCSI. Therefore, we proposed an architecture for AoE based storage support in OpenNebula cloud. The novel storage solution was implemented and the performance evaluation shows that the I/O throughput of the AoE based storage is better (32.5-61.5%) compared to the prior iSCSI based storage and the new storage solution needs less CPU time (41.37%) to provide the same services.","PeriodicalId":202977,"journal":{"name":"2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126245935","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Concurrent Collections on Distributed Memory Theory Put into Practice","authors":"F. Schlimbach, James C. Brodman, K. Knobe","doi":"10.1109/PDP.2013.40","DOIUrl":"https://doi.org/10.1109/PDP.2013.40","url":null,"abstract":"Finding and expressing scalable parallelism is a non-trivial task, in fact it is one of the most difficult parts of software development. Concurrent Collections (CnC) is a novel programming model which aims to make this easy. Its higher level abstractions expose available parallelism implicitly through specifying the semantically required dependencies between individual computation kernels. It has been shown conceptually to be deterministic, independent of the target platform and to separate program semantics from tuning. While abstractly evident, there have been no concrete implementations yet which show that these concepts are actually generally exploitable in practice. We developed an implementation of CnC which exposes these benefits in a single model for both shared and distributed memory. Additionally, we provide a tuning interface which allows defining and optimizing distribution plans easily and flexibly. Unlike most approaches, our implementation allows changing the distribution without altering the computation code itself. This makes the development very productive because it separates the concerns of program semantics and tuning. Last but not least, we show that the new mechanisms not only preserve CnC's deterministic model but are also capable of providing competitive performance. We ported several applications and ran them on a cluster of multi-cores. Our results show that CnC performance matches and often outperforms that of existing state-of-the-art models.","PeriodicalId":202977,"journal":{"name":"2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing","volume":"128 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116318618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Fault Localizing End-to-End Flow Control Protocol for Networks-on-Chip","authors":"G. Schley, N. Batzolis, M. Radetzki","doi":"10.1109/PDP.2013.74","DOIUrl":"https://doi.org/10.1109/PDP.2013.74","url":null,"abstract":"A reliable data exchange between cores of a Network-on-Chip (NoC) is of great importance for correct system behavior. However, data exchange is aggravated by the occurrence of transient and permanent faults in the NoC's communication structure (links). These faults may cause corruption or loss of data which in turn may lead to performance degradation or, in worst case, to complete system failure. In case data is corrupted by a transient fault, a common measure to handle this is to retransmit the data. To ensure that faulty data is retransmitted, so called flow control protocols are applied. In case of permanent faults a simple retransmission is not possible. Permanent faults in e.g. links lead to a permanent corruption of data as long as they are not located. Thus, even retransmissions get corrupted. In this paper we present a fault tolerant end-to-end protocol applicable to arbitrary NoC topologies. It ensures reliable end-to-end communication in presence of transient and permanent faults in the interconnection structure. By means of the protocol's online diagnostic ability, it is capable of locating faulty links and switches without any additional diagnosis hardware.","PeriodicalId":202977,"journal":{"name":"2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128531487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stuart McCool, Ran Shao, P. Milligan, F. Kurugollu
{"title":"Paralysis: An Extensible Multi-tiered Guidance Environment for Program Parallelization and Analysis","authors":"Stuart McCool, Ran Shao, P. Milligan, F. Kurugollu","doi":"10.1109/PDP.2013.64","DOIUrl":"https://doi.org/10.1109/PDP.2013.64","url":null,"abstract":"The heterogeneous computing revolution continues unabated. Yet despite the vast number of naïve users in possession of bespoke software hoping to embrace the opportunities that this revolution has wrought, few approaches proposed in current literature can guide such users in these efforts. The most appropriate choice would appear to be a (semi-)automating compiler. However, these typically target a single device-type and demand the unguided use of directives. Consequently, they are of little use when naïve users are seeking answers to more fundamental questions, such as: which fragments of a program can/should be parallelized, which device should each fragment target, and what speedup will be attained. To this end, this paper expands on previous work and proposes Paralysis - an extensible guidance environment, tiered for varying programmer competencies with support for static and dynamic analysis techniques. At the highest level, guided user experiences are paramount. At the lowest level, underlying functionality is exposed as a set of plug-ins, ensuring longevity. A partial prototype, built atop the Cetus infrastructure, is described. It is used to analyze two serial programs for CUDA execution - the DFT and the Box Blur Filter. Speedups of 15x and 22x are achieved on the basis of the analysis.","PeriodicalId":202977,"journal":{"name":"2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129402315","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Impact of Data Structure Layout on Performance","authors":"Nuno Faria, Rui C. Silva, J. Sobral","doi":"10.1109/PDP.2013.24","DOIUrl":"https://doi.org/10.1109/PDP.2013.24","url":null,"abstract":"One key issue to design parallel applications that scale on multicore systems is how to overcome the memory bottleneck. This paper presents a study of the impact of data structure layouts in locality of memory references, providing insights on strategies to ameliorate the memory bottleneck. The paper compares the performance of Java and C++ STL collections and presents the impact of locality of reference optimisations in a molecular dynamics simulation case study. The case study shows that the selected data structure layout has impact on single core performance, becoming a critical factor in the application scalability on multicore systems. Moreover, data collections provided in the Java language compromise performance due to pointer chasing and lack of spatial locality of memory references.","PeriodicalId":202977,"journal":{"name":"2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122729817","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"ReStream - A Replication Algorithm for Reliable and Scalable Multimedia Streaming","authors":"Shabnam Ataee, B. Garbinato, F. Pedone","doi":"10.1109/PDP.2013.19","DOIUrl":"https://doi.org/10.1109/PDP.2013.19","url":null,"abstract":"Multimedia consumption over the Internet is emerging as one of the largest sink of network resources, making scalable and reliable streaming increasingly challenging. To address this challenge, we propose ReStream, an adaptive replication algorithm that relies on replication to achieve reliable and scalable streaming in resource-constrained environments. Our algorithm dynamically adapts replica placement to maximize the number of consumers under latency and bandwidth constraints, while minimizing the number of replicas. In addition, ReStream supports partitioning, i.e., replicas can be located anywhere in the network and do not necessarily form a connected graph. This allows ReStream to yield the same performance in consumption models where consumers tend to be geographically co-located, as well as in consumption models where consumers placement is totally random.","PeriodicalId":202977,"journal":{"name":"2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124758942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Alfieri, S. Arezzini, G. Barone, U. Becciani, M. Bencivenni, V. Boccia, D. Bottalico, L. Carracciuolo, D. Cesini, A. Ciampa, A. Costantini, S. Cozzini, R. Pietri, M. Drudi, A. Ghiselli, E. Mazzoni, S. Ottani, A. Venturini, P. Veronesi
{"title":"The HPC Testbed of the Italian Grid Infrastructure","authors":"R. Alfieri, S. Arezzini, G. Barone, U. Becciani, M. Bencivenni, V. Boccia, D. Bottalico, L. Carracciuolo, D. Cesini, A. Ciampa, A. Costantini, S. Cozzini, R. Pietri, M. Drudi, A. Ghiselli, E. Mazzoni, S. Ottani, A. Venturini, P. Veronesi","doi":"10.1109/PDP.2013.42","DOIUrl":"https://doi.org/10.1109/PDP.2013.42","url":null,"abstract":"Even though the Italian Grid Infrastructure (IGI) is a general purpose distributed platform, in the past it has been used mainly for serial computations. Parallel applications have been typically executed on supercomputer facilities or, in case of ``not high-end'' HPC applications, on local commodity parallel clusters. Nowadays, with the availability of multiple cores processors, Grid computing is becoming very attractive also for parallel applications but some problems exist in supporting of HPC applications on Grid environment. Here we describe the work made to set up a HPC testbed for ``not high-end'' HPC applications, based on IGI Grid technologies, to find solutions to those problems. Participating sites have been selected among the ones running HPC clusters in Grid environment. Each of them contributed with their specific HPC experience and their available resources to the present test, which encompasses an unprecedented large set of applications from different disciplines in the fields of astronomy, astrophysics, chemistry, climatology, material science and oceanography. In addition to computing resources sharing, the main contribution of each participant was the identification of the real requirements of his application also related to the current middleware limitations and then the realization of a test platform enhanced with additional HPC solutions and configurations developed in a tight collaboration between HPC administrators, users and IGI managers. The main work was on computational resources selection, data management and the definition, the deployment and the documentation of the software execution environment. The outcoming results of the testbed represent the basis of the HPC support in the IGI production infrastructure.","PeriodicalId":202977,"journal":{"name":"2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing","volume":"88 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123364055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A GPU Algorithm Design for Resource Constrained Project Scheduling Problem","authors":"L. Bukata, P. Šůcha","doi":"10.1109/PDP.2013.59","DOIUrl":"https://doi.org/10.1109/PDP.2013.59","url":null,"abstract":"This work proposes a GPU algorithm for a combinatorial problem known in literature as Resource Constrained Project Scheduling Problem. To solve this NP-hard problem, Tabu Search meta-heuristic is selected. All computations are performed on the GPU to minimize required communication bandwidth between the GPU and the CPU. In addition, new evaluation algorithm and effective Tabu List implementation are designed especially for GPUs. Achieved results show that the proposed GPU solution outperforms the equivalent CPU version in both quality of solutions and performance speedup.","PeriodicalId":202977,"journal":{"name":"2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124422945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High Performance Fault-Tolerant Routing Algorithm for NoC-Based Many-Core Systems","authors":"M. Ebrahimi, M. Daneshtalab, J. Plosila","doi":"10.1109/PDP.2013.75","DOIUrl":"https://doi.org/10.1109/PDP.2013.75","url":null,"abstract":"Networks-on-Chip (NoCs) has become a promising approach for the on-chip communication infrastructure of many-core Systems-on-Chip (SoCs). Faults may occur in the NoC both at the router and link level. There are many fault-tolerant approaches presented both in the off-chip and on-chip networks. Some approaches disable some healthy components in order to form a specific shape and others not. Regardless of all varieties, there has always been a common assumption among them. Most of all traditional fault-tolerant methods are based on rerouting packets around a faulty node or region. These approaches affect the performance significantly not only by taking longer paths but also by creating hotspot around a fault. The focus of this paper is to maintain the performance of NoC in the presence of faults. The presented method takes advantage of a fully adaptive routing algorithm using one and two virtual channels along the X and Y dimensions. This method is able to tolerate all cases of one-faulty node without losing the performance of NoC. According to the experimental results, this presented fault-tolerant routing algorithm is able to support up to six faulty nodes in the 8×8 mesh network by up to 98% reliability.","PeriodicalId":202977,"journal":{"name":"2013 21st Euromicro International Conference on Parallel, Distributed, and Network-Based Processing","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-02-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125683699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}