{"title":"Resource Centered Computing Delivering High Parallel Performance","authors":"J. Gustedt, S. Vialle, P. Mercier","doi":"10.1109/IPDPSW.2014.14","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.14","url":null,"abstract":"Modern parallel programming requires a combination of different paradigms, expertise and tuning, that correspond to the different levels in today's hierarchical architectures. To cope with the inherent difficulty, ORWL (ordered read-write locks) presents a new paradigm and toolbox centered around local or remote resources, such as data, processors or accelerators. ORWL programmers describe their computation in terms of access to these resources during critical sections. Exclusive or shared access to the resources is granted through FIFOs and with read-write semantic. ORWL partially replaces a classical runtime and offers a new API for resource centric parallel programming. We successfully ran an ORWL benchmark application on different parallel architectures (a multicore CPU cluster, a NUMA machine, a CPU+GPU cluster). When processing large data we achieved scalability and performance similar to a reference code built on top of MPI+OpenMP+CUDA. The integration of optimized kernels of scientific computing libraries (ATLAS and cuBLAS) has been almost effortless, and we were able to increase performance using both CPU and GPU cores on our hybrid hierarchical cluster simultaneously. We aim to make ORWL a new easy-to-use and efficient programming model and toolbox for parallel developers.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124624966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Acceleration of GPU-Based Ultrasound Simulation via Data Compression","authors":"Andrew A. Haigh, Eric C. McCreath","doi":"10.1109/IPDPSW.2014.140","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.140","url":null,"abstract":"The realistic simulation of ultrasound wave propagation is computationally intensive. The large size of the grid and low degree of reuse of data means that it places a great demand on memory bandwidth. Graphics Processing Units (GPUs) have attracted attention for performing scientific calculations due to their potential for efficiently performing large numbers of floating point computations. However, many applications may be limited by memory bandwidth, especially for data sets whose size is larger than that of the GPU platform. This problem is only partially mitigated by applying the standard technique of breaking the grid into regions and overlapping the computation of one region with the host-device memory transfer of another. In this paper, we implement a memory-bound GPU-based ultrasound simulation and evaluate the use of a technique for improving performance by compressing the data into a fixed-point representation that reduces the time required for inter-host-device transfers. We demonstrate a speedup of 1.5 times on a simulation where the data is broken into regions that must be copied back and forth between the CPU and GPU. We develop a model that can be used to determine the amount of temporal blocking required to achieve near optimal performance, without extensive experimentation. This technique may also be applied to GPU-based scientific simulations in other domains such as computational fluid dynamics and electromagnetic wave simulation.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126672590","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hao Lu, M. Halappanavar, A. Kalyanaraman, Sutanay Choudhury
{"title":"Parallel Heuristics for Scalable Community Detection","authors":"Hao Lu, M. Halappanavar, A. Kalyanaraman, Sutanay Choudhury","doi":"10.1109/IPDPSW.2014.155","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.155","url":null,"abstract":"Community detection has become a fundamental operation in numerous graph-theoretic applications. It is used to reveal natural divisions that exist within real world networks without imposing prior size or cardinality constraints on the set of communities. Despite its potential for application, there is only limited support for community detection on large-scale parallel computers, largely owing to the irregular and inherently sequential nature of the underlying heuristics. In this paper, we present parallelization heuristics for fast community detection using the Louvain method as the serial template. The Louvain method is an iterative heuristic for modularity optimization. Originally developed by Blondel et al. in 2008, the method has become increasingly popular owing to its ability to detect high modularity community partitions in a fast and memory-efficient manner. However, the method is also inherently sequential, thereby limiting its scalability. Here, we observe certain key properties of this method that present challenges for its parallelization, and consequently propose heuristics that are designed to break the sequential barrier. For evaluation purposes, we implemented our heuristics using OpenMP multithreading, and tested them over real world graphs derived from multiple application domains (e.g., internet, citation, biological). Compared to the serial Louvain implementation, our parallel implementation is able to produce community outputs with a higher modularity for most of the inputs tested, in comparable number of iterations, while providing real speedups of up to 8× using 32 threads. In addition, our parallel implementation was able to exhibit weak scaling properties on up to 32 threads.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"249 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129034734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Michael Raitza, Markus Vogt, C. Hochberger, Thilo Pionteck
{"title":"Influence of Magnetic Fields and X-Radiation on Ring Oscillators in FPGAs","authors":"Michael Raitza, Markus Vogt, C. Hochberger, Thilo Pionteck","doi":"10.1109/IPDPSW.2014.26","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.26","url":null,"abstract":"Cryptographic functions are of increasing importance for all kinds of hardware devices. Their strength against attackers not only relies on the particular cryptographic algorithm but also on the quality of the underlying random number generator. Several techniques have been proposed for implementing true random number generators in digital circuits, yet their immunity against ionising radiation and strong magnetic fields has often not been evaluated. In particular FPGAs seem to be prone to such kinds of attacks, as ionising radiation and magnetic fields may not only influence logic gates but also the configuration memory. In this paper we investigate the influence of X-rays and magnetic fields on three different types of ring oscillators. We conduct experiments with a constant X-ray beam generated by a tungsten radiation source and strong static magnetic fields up to 14 T. We show that both magnetic fields and X-radiation do not have any influence on the amount of entropy generated by the ring oscillators, hence these implementations can be considered safe against such attacks. The random number generators are implemented on Altera Cyclone IV, Lattice LFE3, and Xilinx Spartan 6 FPGAs.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"392 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123364700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Bag-of-Task Scheduling on Power-Aware Clusters Using a DVFS-Based Mechanism","authors":"G. Terzopoulos, H. Karatza","doi":"10.1109/IPDPSW.2014.95","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.95","url":null,"abstract":"Energy reduction is very important nowadays. A large percentage of the workload submitted to large-scale systems is bag-of-tasks (BoT) applications. Each BoT is a collection of independent tasks that do not communicate with each other. They are used in astronomy, Monte Carlo simulations, data mining, fractal calculations, image processing and massive searches. Due to their importance, BoT scheduling is extensively studied regarding performance. In this paper we view BoT scheduling from an energy efficiency perspective. In order to save energy, we apply a Dynamic Voltage/Frequency Scaling (DVFS) mechanism to a heterogeneous cluster environment where BoTs are submitted. A cluster environment is selected due to the fact that clusters are often used as underlying basic components in grids and clouds. In order for our simulation experiments to be more realistic regarding the workload applied in the system, we also consider high-priority tasks. Extensive simulation experiments show that by applying the proposed DVFS mechanism when BoTs are executed, we can achieve energy savings up to 13% without affecting the execution of high-priority tasks.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114339921","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Parallel and Distributed Computing across the Computer Science Curriculum","authors":"D. J. John, Stan J. Thomas","doi":"10.1109/IPDPSW.2014.121","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.121","url":null,"abstract":"Two recent curriculum studies, the ACM/IEEE Curricula 2013 Report and the NSF/IEEE-TCPP Curriculum Initiative on Parallel and Distributed Computing, argue that every undergraduate computer science program should include topics in parallel and distributed computing (PDC). Although not within the scope of these reports, there is also a need for students in computing related general education courses to be aware of the role that parallel and distributed computing technologies play in the computing landscape. One approach to integrating these topics into existing curricula is to spread them across several courses. However, this approach requires development of multiple instructional modules targeted to introduce PDC concepts at specific points in the curriculum. Such modules need to mesh with the goals of the courses for which they are designed in such a way that minimal material has to be removed from existing topics. At the same time the modules should provide students with an understanding of and experience employing fundamental PDC concepts. In this paper we report on our experience developing and deploying such modules.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128155409","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Over-clocking of Linear Projection Designs through Device Specific Optimisations","authors":"R. Duarte, C. Bouganis","doi":"10.1109/IPDPSW.2014.25","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.25","url":null,"abstract":"Frequently, applications such as image and video processing rely on implementations of the Linear Projection algorithm with high throughput and low latency requirements. This work presents a framework to optimise Linear Projection designs that excel typical design implementations via a pre-characterisation of over-clocked arithmetic units. It is well known that the delay models used by synthesis tools are generic and tuned for the worst performance possible of a given fabrication process. Hence, they impose a heavy penalty in the possible maximum performance offered by the fabrication process. The proposed optimisation framework focuses on the optimisation of the generic multipliers, as they are the arithmetic operators with the most critical paths in the data path of a linear projection design, by performing a performance characterisation step on the target device. Experiments demonstrate that the proposed framework is able to generate Linear Projection designs that achieve higher throughput (up to 1.85 times) while producing less errors than typical implementation methodologies.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121847606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lei Jin, Zhaokang Wang, Rong Gu, C. Yuan, Y. Huang
{"title":"Training Large Scale Deep Neural Networks on the Intel Xeon Phi Many-Core Coprocessor","authors":"Lei Jin, Zhaokang Wang, Rong Gu, C. Yuan, Y. Huang","doi":"10.1109/IPDPSW.2014.194","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.194","url":null,"abstract":"As a new area of machine learning research, the deep learning algorithm has attracted a lot of attention from the research community. It may bring human beings to a higher cognitive level of data. Its unsupervised pre-training step allows us to find high-dimensional representations or abstract features which work much better than the principal component analysis (PCA) method. However, it will face problems when being applied to deal with large scale data due to its intensive computation from many levels of training process against large scale data. The sequential deep learning algorithms usually can not finish the computation in an acceptable time. In this paper, we propose a many-core algorithm which is based on a parallel method and is used in the Intel Xeon Phi many-core systems to speed up the unsupervised training process of Sparse Autoencoder and Restricted Boltzmann Machine (RBM). Using the sequential training algorithm as a baseline to compare, we adopted several optimization methods to parallelize the algorithm. The experimental results show that our fully-optimized algorithm gains more than 300-fold speedup on parallelized Sparse Autoencoder compared with the original sequential algorithm on the Intel Xeon Phi coprocessor. Also, we ran the fully-optimized code on both the Intel Xeon Phi coprocessor and an expensive Intel Xeon CPU. Our method on the Intel Xeon Phi coprocessor is 7 to 10 times faster than the Intel Xeon CPU for this application. In addition to this, we compared our fully-optimized code on the Intel Xeon Phi with a Matlab code running on single Intel Xeon CPU. Our method on the Intel Xeon Phi runs 16 times faster than the Matlab implementation. The result also suggests that the Intel Xeon Phi can offer an efficient but more general-purposed way to parallelize the deep learning algorithm compared to GPU. It also achieves faster speed with better parallelism than the Intel Xeon CPU.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122007924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Framework for Customizing Virtual 3-D Reconfigurable Platforms at Run-Time","authors":"K. Siozios, D. Soudris, M. Hübner","doi":"10.1109/IPDPSW.2014.201","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.201","url":null,"abstract":"Existing application domains exhibit variations in terms of complexity, performance and power consumption, whereas their efficient implementation onto general-purpose reconfigurable platforms is not always a viable solution. Towards this goal, throughout this paper, we introduce a software-supported framework for supporting efficient customization of these platforms. Rather than similar approaches, where the phase (design-time), our solution provides post-fabrication customization of architectural parameters based on application's inherent requirements through a virtualization layer. For evaluation purposes, the introduced framework was applied to 3-D reconfigurable architectures. Experimental results with applications from various domains prove the effectiveness of our solution, as we achieve average delay and power reduction by 1.43X and 1.15X , respectively, as compared to the existing way for application implementation.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132479099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Trust-Based Security for the Spanning Tree Protocol","authors":"Yingxu Lai, Qiuyue Pan, Zenghui Liu, Yinong Chen, Zhizheng Zhou","doi":"10.1109/IPDPSW.2014.150","DOIUrl":"https://doi.org/10.1109/IPDPSW.2014.150","url":null,"abstract":"Attacks executed on Spanning Tree Protocol (STP) expose the weakness of link layer protocols and put the higher layers in jeopardy. Although the problems have been studied for many years and various solutions have been proposed, many security issues remain. To enhance the security and credibility of layer-2 network, we propose a trust-based spanning tree protocol aiming at achieving a higher credibility of LAN switch with a simple and lightweight authentication mechanism. If correctly implemented in each trusted switch, the authentication of trust-based STP can guarantee the credibility of topology information that is announced to other switch in the LAN. To verify the enforcement of the trusted protocol, we present a new credible evaluation method of the STP using a specification-based state model. We implement a prototype of trust-based STP to investigate its practicality. Experiment shows that the trusted protocol can achieve security goals and effectively avoid STP attacks with a lower computation overhead and good convergence performance.","PeriodicalId":153864,"journal":{"name":"2014 IEEE International Parallel & Distributed Processing Symposium Workshops","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131759585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}