{"title":"A Flexible and Portable Large-Scale DGEMM Library for Linpack on Next-Generation Multi-GPU Systems","authors":"D. Rohr, V. Lindenstruth","doi":"10.1109/PDP.2015.89","DOIUrl":"https://doi.org/10.1109/PDP.2015.89","url":null,"abstract":"In recent years, high performance computing has benefitted greatly from special accelerator cards such as GPUs. Matrix multiplication performed by the BLAS function DGEMM is one of the prime examples where such accelerators excel. DGEMM is the computational hotspot of many tasks, among them the Linpack benchmark. Current GPUs achieve more than 1 TFLOPS real performance in this task. Being connected via PCI Express, one can easily install multiple GPUs in a single compute node. This enables the construction of multi-TFLOPS systems out of off-the-shelf components. At such high performance, it is often complicated to feed the GPUs with sufficient data to run at full performance. In this paper we first analyze the scalability of our DGEMM implementation for multiple fast GPUs. Then we suggest a new scheme optimized for this situation and we present an implementation.","PeriodicalId":285111,"journal":{"name":"2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127106343","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Multi-kernel Auto-Tuning on GPUs: Performance and Energy-Aware Optimization","authors":"J. Guerreiro, A. Ilic, N. Roma, P. Tomás","doi":"10.1109/PDP.2015.44","DOIUrl":"https://doi.org/10.1109/PDP.2015.44","url":null,"abstract":"Prompted by their very high computational capabilities and memory bandwidth, Graphics Processing Units (GPUs) are already widely used to accelerate the execution of many scientific applications. However, programmers are still required to have a very detailed knowledge of the GPU internal architecture when tuning the kernels, in order to improve either performance or energy-efficiency. Moreover, different GPU devices have different characteristics, moving a kernel to a different GPU typically requires re-tuning the kernel execution, in order to efficiently exploit the underlying hardware. The procedure proposed in this work is based on real-time kernel profiling and GPU monitoring and it automatically tunes parameters from several concurrent kernels to maximize the performance or minimize the energy consumption. Experimental results on NVIDIA GPU devices with up to 4 concurrent kernels show that the proposed solution achieves near optimal configurations. Furthermore, significant energy savings can be achieved by using the proposed energy-efficiency auto-tuning procedure.","PeriodicalId":285111,"journal":{"name":"2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing","volume":"275 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127552234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
F. Cicirelli, G. Folino, Agostino Forestiero, Andrea Giordano, C. Mastroianni, G. Spezzano
{"title":"Strategies for Parallelizing Swarm Intelligence Algorithms","authors":"F. Cicirelli, G. Folino, Agostino Forestiero, Andrea Giordano, C. Mastroianni, G. Spezzano","doi":"10.1109/PDP.2015.101","DOIUrl":"https://doi.org/10.1109/PDP.2015.101","url":null,"abstract":"Swarm intelligence algorithms, based on multi-agent systems, are often used to solve complex problems that are not affordable through classical centralized/deterministic solutions. In many cases, to enhance the performance of such algorithms, the computation can be distributed to parallel/distributed nodes, in accordance with different strategies. Specifically, parallelization can be achieved either by partitioning the space in which agents operate among the nodes, or by assigning the entire space to each node but distributing input data through a sampling approach. Another choice is whether or not the management of conflicts is needed to prevent possible loss of data consistency. This paper discusses such issues, while referring to two well-known types of swarm intelligence algorithms -- ants and flocking -- and compares the mentioned strategies, evaluating the performance results in terms of speedup.","PeriodicalId":285111,"journal":{"name":"2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129140843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Themistoklis Giitsidis, E. Karakasis, A. Gasteratos, G. Sirakoulis
{"title":"Human and Fire Detection from High Altitude UAV Images","authors":"Themistoklis Giitsidis, E. Karakasis, A. Gasteratos, G. Sirakoulis","doi":"10.1109/PDP.2015.118","DOIUrl":"https://doi.org/10.1109/PDP.2015.118","url":null,"abstract":"Illegal migration as well as wildfires constitute commonplace situations in southern European countries, where the mountainous terrain and thick forests make the surveillance and location of these incidents a tall task. This territory could benefit from Unmanned Aerial Vehicles (UAVs) equipped with optical and thermal sensors in conjunction with sophisticated image processing and computer vision algorithms, in order to detect suspicious activity or prevent the spreading of a fire. Taking into account that the flight height is about to two kilometers, human and fire detection algorithms are mainly based on blob detection. For both processes thermal imaging is used in order to improve the accuracy of the algorithms, while in the case of human recognition information like movement patterns as well as shadow size and shape are also considered. For fire detection a blob detector is utilized in conjunction with a color based descriptor, applied to thermal and optical images, respectively. Unlike fire, human detection is a more demanding process resulting in a more sophisticated and complex algorithm. The main difficulty of human detection originates from the high flight altitude. In images taken from high altitude where the ground sample distance is not small enough, people appear as small blobs occupying few pixels, leading corresponding research works to be based on blob detectors to detect humans. Their shadows as well as motion detection and object tracking can then be used to determine whether these regions of interest do depict humans. This work follows this motif as well, nevertheless, its main novelty lies in the fact that the human detection process is adapted for high altitude and vertical shooting images in contrast with the majority of other similar works where lower altitudes and different shooting angles are considered. Additionally, in the interest of making our algorithms as fast as possible in order for them to be used in real time during the UAV flights, parallel image processing with the help of a specialized hardware device based on Field Programmable Gate Array (FPGA) is being worked on.","PeriodicalId":285111,"journal":{"name":"2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117347815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Arne De Coninck, D. Kourounis, F. Verbosio, O. Schenk, B. Baets, S. Maenhout, J. Fostier
{"title":"Towards Parallel Large-Scale Genomic Prediction by Coupling Sparse and Dense Matrix Algebra","authors":"Arne De Coninck, D. Kourounis, F. Verbosio, O. Schenk, B. Baets, S. Maenhout, J. Fostier","doi":"10.1109/PDP.2015.94","DOIUrl":"https://doi.org/10.1109/PDP.2015.94","url":null,"abstract":"Genomic prediction for plant breeding requires taking into account environmental effects and variations of genetic effects across environments. The latter can be modelled by estimating the effect of each genetic marker in every possible environmental condition, which leads to a huge amount of effects to be estimated. Nonetheless, the information about these effects is only sparsely present, due to the fact that plants are only tested in a limited number of environmental conditions. In contrast, the genotypes of the plants are a dense source of information and thus the estimation of both types of effects in one single step would require as well dense as sparse matrix formalisms. This paper presents a way to efficiently apply a high performance computing infrastructure for dealing with large-scale genomic prediction settings, relying on the coupling of dense and sparse matrix algebra.","PeriodicalId":285111,"journal":{"name":"2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132790338","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Basile, A. Lioy, Christian Pitscheider, Shilong Zhao
{"title":"A Formal Model of Policy Reconciliation","authors":"C. Basile, A. Lioy, Christian Pitscheider, Shilong Zhao","doi":"10.1109/PDP.2015.42","DOIUrl":"https://doi.org/10.1109/PDP.2015.42","url":null,"abstract":"This paper proposes a novel approach to perform the reconciliation of security policies by means of user-defined reconciliation strategies. The proposed policy reconciliation model allows several degree of freedom when specifying reconciliation strategies, which can be based not only on rule actions, like most of the works in literature, but also on other rule data (e.g., the conditions) and other external data (e.g., rule priorities, policy priorities). Additionally, it can be applied to reconcile policies at runtime and off-line, that is, it allows the generation of a reconciled policy. Moreover, the reconciliation process generates a detailed report on all the decision taken. Given its expressiveness, the approach can be also applied to simplify the policy specification process. The model has been validated against a practical example, the definition of the application layer filtering policy in a corporate scenario, and its performance has been tested with synthetic policies. Both validation and performance analysis gave encouraging results.","PeriodicalId":285111,"journal":{"name":"2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131162693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Anup Patel, M. Daftedar, M. Shalan, M. El-Kharashi
{"title":"Embedded Hypervisor Xvisor: A Comparative Analysis","authors":"Anup Patel, M. Daftedar, M. Shalan, M. El-Kharashi","doi":"10.1109/PDP.2015.108","DOIUrl":"https://doi.org/10.1109/PDP.2015.108","url":null,"abstract":"Virtualization technology has shown immense popularity within embedded systems due to its direct relationship with cost reduction, better resource utilization, and higher performance measures. Efficient hypervisors are required to achieve such high performance measures in virtualized environments, while taking into consideration the low memory footprints as well as the stringent timing constraints of embedded systems. Although there are a number of open-source hypervisors available such as Xen, Linux KVM and OKL4 Micro visor, this is the first paper to present the open-source embedded hypervisor Extensible Versatile hyper Visor (Xvisor) and compare it against two of the commonly used hypervisors KVM and Xen in-terms of comparison factors that affect the whole system performance. Experimental results on ARM architecture prove Xvisor's lower CPU overhead, higher memory bandwidth, lower lock synchronization latency and lower virtual timer interrupt overhead and thus overall enhanced virtualized embedded system performance.","PeriodicalId":285111,"journal":{"name":"2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133609671","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Francisco Javier Nieto de Santos, Sergio García Villalonga
{"title":"Exploiting Local Clouds in the Internet of Everything Environment","authors":"Francisco Javier Nieto de Santos, Sergio García Villalonga","doi":"10.1109/PDP.2015.117","DOIUrl":"https://doi.org/10.1109/PDP.2015.117","url":null,"abstract":"The Internet of Everything is opening new opportunities and challenges which will be faced during the following years. Huge amounts of data will be generated and consumed, so Internet of Things frameworks will need to provide new capabilities related to Big Data analysis, scalability and performance. We believe the formation of local clouds of devices, close to the location where data is created and consumed, is a good solution to overcome these issues which may impact in security as well. The combination of local and remote resources together with the appropriate allocation algorithms for their management will provide the means to enable the new required features, going beyond the current state of the art and still leaving enough evolution capacity for future scenarios.","PeriodicalId":285111,"journal":{"name":"2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125136753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Implementing MVC Decoding on Homogeneous NoCs: Circuit Switching or Wormhole Switching","authors":"Ning Ma, Z. Zou, Zhonghai Lu, Lirong Zheng","doi":"10.1109/PDP.2015.48","DOIUrl":"https://doi.org/10.1109/PDP.2015.48","url":null,"abstract":"To implement multiview video decoding on network on-chip (NoC) based homogeneous multicore architectures, the selection of switching techniques for routers is one of the most important aspects for design space exploration. Circuit switching and wormhole switching are two most feasible switching techniques for on-chip networks. To choose the suitable switching technique, we perform the comparison on decoding speed of the whole system, link utilization and delay between circuit switching and wormhole switching for implementing eight-view QVGA video decoding on 4 × 4 NoCs at 30 fps. The required link bandwidths are both around 800 Mbps with the similar network utilization and delay. We conclude that, to implement multiview video decoding on homogeneous NoCs, circuit switching is more suitable considering the similar performance and lower cost compared with wormhole switching.","PeriodicalId":285111,"journal":{"name":"2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125214928","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
S. Soroushnia, M. Daneshtalab, T. Pahikkala, J. Plosila
{"title":"Parallel Implementation of Fuzzified Pattern Matching Algorithm on GPU","authors":"S. Soroushnia, M. Daneshtalab, T. Pahikkala, J. Plosila","doi":"10.1109/PDP.2015.75","DOIUrl":"https://doi.org/10.1109/PDP.2015.75","url":null,"abstract":"Approximate pattern discovery is one of the fundamental and challenging problems in computer science. Fast and high performance algorithms are highly demanded in many applications in bioinformatics and computational molecular biology, which are the domains that are mostly and directly benefit from any enhancement of pattern matching theoretical knowledge and solutions. This paper proposed an efficient GPU implementation of fuzzified Aho-Corasick algorithm using Levenshtein method and N-gram technique as a solution for approximate pattern matching problem.","PeriodicalId":285111,"journal":{"name":"2015 23rd Euromicro International Conference on Parallel, Distributed, and Network-Based Processing","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-03-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128023708","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}