{"title":"Generalizing the Utility of GPUs in Large-Scale Heterogeneous Computing Systems","authors":"S. Xiao, Wu-chun Feng","doi":"10.1109/IPDPSW.2012.325","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.325","url":null,"abstract":"Graphics Processing Units (GPUs) have been widely used as accelerators in large-scale heterogeneous computing systems. However, current programming models can only support the utilization of local GPUs. When using non-local GPUs, programmers need to explicitly call API functions for data communication across computing nodes. As such, programming GPUs in large-scale computing systems is more challenging than local GPUs since local and remote GPUs have to be dealt with separately. In this work, we propose a virtual OpenCL (VOCL) framework to support the transparent virtualization of GPUs. This framework, based on the OpenCL programming model, exposes physical GPUs as decoupled virtual resources that can be transparently managed independent of the application execution. To reduce the virtualization overhead, we optimize the GPU memory accesses and kernel launches. We also extend the VOCL framework to support live task migration across physical GPUs to achieve load balance and/or quick system maintenance. Our experiment results indicate that VOCL can greatly simplify the task of programming cluster-based GPUs at a reasonable virtualization cost.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"171 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133310246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Simulation Study on Urban Water Threat Detection in Modern Cyberinfrastructures","authors":"Lizhe Wang, Dan Chen, Ze Deng, R. Ranjan","doi":"10.1109/IPDPSW.2012.127","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.127","url":null,"abstract":"The computation of Contaminant Source Characterization (CSC) is a critical research issue in Water Distribution System (WDS) management. We use a simulation framework to identify optimized locations of sensors that lead to fast detection of contamination sources. The optimization engine is based on a Genetic Algorithm (GA) that interprets trial solutions as individuals. During the optimization process many thousands of these solutions are generated. For a large WDS, the calculation of these solutions are non-trivial and time consuming. Hence, it is a compute intensive application that requires significant compute resources. Furthermore, we strive to generate solutions quickly in order to respond to the urgency of a response. To carry out the calculations we require user-level middleware that can be supporting the workflow of the application and manages the resource assignment in an efficient and fault tolerant fashion. To do so we have prototyped the middleware framework that provides a convenient command line and portal layer of steering applications on Grids. Internally, we utilize a sophisticated workflow engine that provides the ability to access elementary fault tolerant mechanisms for job scheduling. This includes the management of job replicas and the reaction on late return of results. We report the test results of CSC problem solving on a real Grid test bed - the Tera Grid test bed. In addition, we contrast this system architecture with a Hadoop-based implementation that automatically includes fault tolerance. The later activity has been conducted on Future Grid.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"190 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134277050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Conflict Avoidance Scheduling Using Grouping List for Transactional Memory","authors":"Do-Chan Choi, Seung-Hun Kim, W. Ro","doi":"10.1109/IPDPSW.2012.66","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.66","url":null,"abstract":"Conventional Transactional Memory (TM) systems may experience performance degradation in applications with high contention, given the fact that execution of transaction will frequently restart due to conflicts. The restarting of transaction essentially requires rollback that is a wasteful operation. To address this point, we developed a system to reduce the overhead caused by high contention. In this paper, we present a method called Conflict Avoidance Scheduling (CAS), which prevents the conflicts in high contention by use of conflict characteristic. In CAS, threads that execute transactions which have high probability of conflicts are grouped together. Based on the group information, concurrent execution of threads in the same group is restricted. Therefore, threads that may cause conflict are serially executed. We evaluate the performance of the proposed design by comparing it with Log TM-SE. The simulation results show that our system improves the performance by 23% on an average in applications with high contention, as compared with the conventional Log TM-SE.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"21 7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134555427","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Fast Parallel Implementation of Molecular Dynamics with the Morse Potential on a Heterogeneous Petascale Supercomputer","authors":"Qiang Wu, Canqun Yang, Feng Wang, Jingling Xue","doi":"10.1109/IPDPSW.2012.13","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.13","url":null,"abstract":"Molecular Dynamics (MD) simulations have been widely used in the study of macromolecules. To ensure an acceptable level of statistical accuracy relatively large number of particles are needed, which calls for high performance implementations of MD. These days heterogeneous systems, with their high performance potential, low power consumption, and high price-performance ratio, offer a viable alternative for running MD simulations. In this paper we introduce a fast parallel implementation of MD simulation with the Morse potential on Tianhe-1A, a petascale heterogeneous supercomputer. Our code achieves a speedup of 3.6× on one NVIDIA Tesla M2050 GPU (containing 14 Streaming Multiprocessors) compared to a 2.93GHz six-core Intel Xeon X5670 CPU. In addition, our code runs faster on 1024 compute nodes (with two CPUs and one GPU inside a node) than on 4096 GPU-excluded nodes, effectively rendering one GPU more efficient than six six-core CPUs. Our work shows that large-scale MD simulations can benefit enormously from GPU acceleration in petascale supercomputing platforms. Our performance results are achieved by using (1) a patch-cell design to exploit parallelism across the simulation domain, (2) a new GPU kernel developed by taking advantage of Newton's Third Law to reduce redundant force computation on GPUs, (3) two optimization methods including a dynamic load balancing strategy that adjusts the workload, and a communication overlapping method to overlap the communications between CPUs and GPUs.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115594582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Inference of Huge Trees under Maximum Likelihood","authors":"F. Izquierdo-Carrasco, A. Stamatakis","doi":"10.1109/IPDPSW.2012.309","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.309","url":null,"abstract":"The wide adoption of Next-Generation Sequencing technologies in recent years has generated an avalanche of genetic data, which poses new challenges for large-scale maximum likelihood-based phylogenetic analyses. Improving the scalability of search algorithms and reducing the high memory requirements for computing the likelihood represent major computational challenges in this context. We have introduced methods for solving these key problems and provided respective proof-of-concept implementations. Moreover, we have developed a new tree search strategy that can reduce run times by more than 50% while yielding equally good trees (in the statistical sense). To reduce memory requirements, we explored the applicability of external memory (out-of-core) algorithms as well as a concept that trades memory for additional computations in the likelihood function. The latter concept, only induces a surprisingly small increase in overall execution times. When trading 50% of the required RAM for additional computations, the average execution time increase- because of additional computations-amounts to only 15%. All concepts presented here are sufficiently generic such that they can be applied to all programs that rely on the phylogenetic likelihood function. Thereby, the approaches we have developed will contribute to enable large-scale inferences of whole-genome phylogenies.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"71 6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114294008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Lessons Learned after the Introduction of Parallel and Distributed Computing Concepts into ECE Undergraduate Curricula at UTN-Bahía Blanca Argentina","authors":"Javier Iparraguirre, G. Friedrich, Ricardo Coppo","doi":"10.1109/IPDPSW.2012.163","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.163","url":null,"abstract":"In 2011 we introduced an elective course on Parallel Processing into the ECE undergraduate curricula. UTN Bahía Blanca was one of the first Universities in Argentina that decided to teach OpenCL. During the same year, we also began participation in the NSF/IEEE TCCP 2011 Early Adopters Program. This work summarizes the lessons we learned in our endeavor of teaching parallel and distributed computing concepts. Additionally, it discusses future improvements to our teaching methods and proposes modifications to our initial curricula.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"190 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117336843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A New Task Allocation Algorithm Based on Dynamic Coalition in WSNs","authors":"Chengyu Chen, Wenzhong Guo, Guolong Chen","doi":"10.1109/IPDPSW.2012.153","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.153","url":null,"abstract":"Because nodes in WSNs have limited resources and usually work in a severe dynamic environment without human participation, existing task allocation algorithms in WSNs cannot provide fault-tolerant mechanism. Therefore, a new task allocation algorithm which adopts PSO algorithm and multi-agent technology is proposed by us. The algorithm employs primary/backup copy (PB) technology with backup copy overlapping. The simulation experiment shows the proposed algorithm can effectively improve task guarantee ratio save more energy and prolong the lifetime of network.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124810475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xinmin Tian, Hideki Saito, M. Girkar, S. Preis, Sergey Kozhukhov, Aleksei G. Cherkasov, Clark Nelson, Nikolay Panchenko, Robert Y. Geva
{"title":"Compiling C/C++ SIMD Extensions for Function and Loop Vectorizaion on Multicore-SIMD Processors","authors":"Xinmin Tian, Hideki Saito, M. Girkar, S. Preis, Sergey Kozhukhov, Aleksei G. Cherkasov, Clark Nelson, Nikolay Panchenko, Robert Y. Geva","doi":"10.1109/IPDPSW.2012.292","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.292","url":null,"abstract":"SIMD vectorization has received significant attention in the past decade as an important method to accelerate scientific applications, media and embedded applications on SIMD architectures such as Intel® SSE, AVX, and IBM* AltiVec. However, most of the focus has been directed at loops, effectively executing their iterations on multiple SIMD lanes concurrently relying upon program hints and compiler analysis. This paper presents a set of new C/C++ high-level vector extensions for SIMD programming, and the Intel® C++ product compiler that is extended to translate these vector extensions and produce optimized SIMD instruction sequences of vectorized functions and loops. For a function, our main idea is to vectorize the entire function for callers instead of just vectorizing loops (if any) inside the function. It poses the challenge of dealing with complicated control-flow in the function body, and matching caller and callee for SIMD vector calls while vectorizing caller functions (or loops) and callee functions. Our compilation methods for automatically compiling vector extensions are described. We present performance results of several non-trivial visual computing, computational, and simulation workloads, utilizing SIMD units through the vector extensions on Intel® Multicore 128-bit SIMD processors, and we show that significant SIMD speedups (3.07x to 4.69x) are achieved over the serial execution.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124828684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Huaiming Song, Hui Jin, Jun He, Xian-He Sun, R. Thakur
{"title":"A Server-Level Adaptive Data Layout Strategy for Parallel File Systems","authors":"Huaiming Song, Hui Jin, Jun He, Xian-He Sun, R. Thakur","doi":"10.1109/IPDPSW.2012.246","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.246","url":null,"abstract":"Parallel file systems are widely used for providing a high degree of I/O parallelism to mask the gap between I/O and memory speed. However, peak I/O performance is rarely attained due to complex data access patterns of applications. Based on the observation that the I/O performance of small requests is often limited by the request service rate, and the performance of large requests is limited by I/O bandwidth, we take into consideration both factors and propose a server-level adaptive data layout strategy. The proposed strategy adopts different stripe sizes for different file servers according to the data access characteristics on each individual server. We let the file servers that can fully utilize bandwidth hold more data, and the file servers that are limited with request service rate hold less data. As a result, heavy-load servers can offload some data accesses to light-load servers for potential improvement of I/O performance. We present a method to measure access cost for each data block and then utilize an equal-depth histogram approach to distributed data blocks across multiple servers adaptively, so as to balance data accesses on all file servers. Analytical and experimental results demonstrate that the proposed server-level adaptive layout strategy can improve I/O performance by as much as 80.3% and is more appropriate for applications with complex data access patterns.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124869028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Hybrid Differential Evolution Using Low-Discrepancy Sequences for Image Segmentation","authors":"A. Nakib, B. Daachi, P. Siarry","doi":"10.1109/IPDPSW.2012.79","DOIUrl":"https://doi.org/10.1109/IPDPSW.2012.79","url":null,"abstract":"The image thresholding problem can be seen as a problem of optimization of an objective function. Many thresholding techniques have been proposed in the literature and the approximation of normalized histogram of an image by a mixture of Gaussian distributions is one of them. Typically, finding the parameters of Gaussian distributions leads to a nonlinear optimization problem, of which solution is computationally expensive and time-consuming. In this paper, an enhanced version of the classical differential evolution algorithm using low-discrepancy sequences and a local search, called LDE, is used to compute these parameters. Experimental results demonstrate the ability of the algorithm in finding optimal thresholds in case of multilevel thresholding.","PeriodicalId":378335,"journal":{"name":"2012 IEEE 26th International Parallel and Distributed Processing Symposium Workshops & PhD Forum","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-05-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123120819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}