{"title":"Toward Automatic Optimized Code Generation for Multiprecision Modular Exponentiation on a GPU","authors":"Niall Emmart, C. Weems","doi":"10.1109/IPDPSW.2013.149","DOIUrl":"https://doi.org/10.1109/IPDPSW.2013.149","url":null,"abstract":"Multiprocessing modular exponentiation has a variety of uses, including cryptography, prime testing and computational number theory. It is also a very costly operation to compute. GPU parallelism can be used to accelerate these computations, but to use the GPU efficiently, a problem must involve a significant number of simultaneous exponentiation operations. Handling a large number of TLS/SSL encrypted sessions in a data center is a significant problem that fits this profile. We have developed a framework that enables generation of highly efficient NVIDIA PTX implementations of exponentiation operations for different GPU architectures and problem instances. One of the challenges in generating such code is that PTX is not a true assembly language, but is instead a virtual instruction set that is compiled and optimized in different ways for different generations of GPU hardware. Thus, the same PTX code runs with different levels of efficiency on different machines. And as the precision of the exponentiation values changes, each architecture has its own break-even points where a different algorithm or parallelization strategy must be employed. To make the code efficient for a given problem instance and architecture thus requires searching a multidimensional space of algorithms and configurations, by generating thousands of lines of carefully constructed PTX code for each combination, executing it, validating the numerical result, and evaluating its actual performance. Our framework automates much of this process, and produces exponentiation code that is up to six times faster than the best known hand-coded implementations. More importantly, the framework enables users to relatively quickly find the best configuration for each new GPU architecture. Our framework is also the basis for the eventual creation of a multiprocessing matrix arithmetic package for GPU cluster systems that will be portable across multiple generations of GPU.","PeriodicalId":234552,"journal":{"name":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114682938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Subdomain Mapping Approach to Enhance the Coupling in Earth System Modeling","authors":"Yingsheng Ji, Guangwen Yang, Li Liu, Shu Wang","doi":"10.1109/IPDPSW.2013.62","DOIUrl":"https://doi.org/10.1109/IPDPSW.2013.62","url":null,"abstract":"To alleviate the technical barrier for different scientific communities to join in Earth System Modeling, the coupler is widely used to link two or more climate simulation applications called models. With the advent of advanced models, the bigger data volume for transfer and transformation will incur great performance overhead to the coupler. However, the current independent modular design cannot obtain the optimal coupling performance. In this paper, we propose a method called sub domain mapping approach to improve coupling. Our method can merge all the communications during one coupling execution and thus reduce the dispensable cost. The evaluation results show that the sub domain mapping scheme can achieve considerable speedup ranged from 1.14 to 4.81 fold on a cluster interconnected via high-speed Infiniband. The results also show that our approach can effectively enhance the coupler's performance and scalability in most cases.","PeriodicalId":234552,"journal":{"name":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"EC-1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126549008","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"An On-chip Heterogeneous Implementation of a General Sparse Linear Solver","authors":"Arash Sadrieh, Stefano Charissis, A. Hill","doi":"10.1109/IPDPSW.2013.51","DOIUrl":"https://doi.org/10.1109/IPDPSW.2013.51","url":null,"abstract":"Inter-device communication is a common limitation of GPGPU computing methods. The on-chip heterogeneous architecture of a recent class of accelerated processing units (APUs), that combine programmable CPU and GPU cores on the same die, presents an opportunity to address this problem. Here we describe an APU-based heterogeneous implementation of the Jacobi-preconditioned conjugate gradient method and identify a set of optimal configurations based on examination of standard matrices. By leveraging the low-latency memory transactions of the APU and exploiting CPU/GPU cohabitation for concurrent vector operations, a comparable performance to that of a high-end GPU running CUSP is achieved. Our results show that use of on-chip heterogeneous architectures can be attractively cost-effective and even show better performance for applications with a low number of linear solver iterations and when device-to-device data transfer is significant. Accordingly, the APU architecture and associated GPAPU methods have significant potential as a low cost, energy efficient alternative for parallel HPC architectures.","PeriodicalId":234552,"journal":{"name":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122270554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Transparent Optimization of Parallel File System I/O via Standard System Tool Enhancement","authors":"Paul Z. Kolano","doi":"10.1109/IPDPSW.2013.192","DOIUrl":"https://doi.org/10.1109/IPDPSW.2013.192","url":null,"abstract":"Standard system tools employed by users on a daily basis do not take full advantage of parallel file system I/O bandwidth and do not understand associated idiosyncrasies such as Lustre striping. This can lead ton on-optimal utilization of both the user's time and system resources. This paper describes a set of modifications made to existing tools that increase parallelism and automatically handle striping. These modifications result insignificant performance gains in a transparent manner with maximum speedups of 27×, 15×, and 31× for parallelized cp, tar creation, and tar extraction, respectively.","PeriodicalId":234552,"journal":{"name":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"109 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131926182","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High Performance GPU Accelerated Local Optimization in TSP","authors":"K. Rocki, R. Suda","doi":"10.1109/IPDPSW.2013.227","DOIUrl":"https://doi.org/10.1109/IPDPSW.2013.227","url":null,"abstract":"This paper presents a high performance GPU accelerated implementation of 2-opt local search algorithm for the Traveling Salesman Problem (TSP). GPU usage significantly decreases the execution time needed for tour optimization, however it also requires a complicated and well tuned implementation. With the problem size growing, the time spent on local optimization comparing the graph edges grows significantly. According to our results based on the instances from the TSPLIB library, the time needed to perform a simple local search operation can be decreased approximately 5 to 45 times compared to a corresponding parallel CPU code implementation using 6 cores. The code has been implemented in OpenCL and as well as in CUDA and tested on AMD and NVIDIA devices. The experimental studies show that the optimization algorithm using the GPU local search converges from up to 300 times faster compared to the sequential CPU version on average, depending on the problem size. The main contributions of this paper are the problem division scheme exploiting data locality which allows to solve arbitrarily big problem instances using GPU and the parallel implementation of the algorithm itself.","PeriodicalId":234552,"journal":{"name":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"219 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134541628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"High Throughput Parallel Implementation of Aho-Corasick Algorithm on a GPU","authors":"Nhat-Phuong Tran, Myungho Lee, Sugwon Hong, Jaeyoung Choi","doi":"10.1109/IPDPSW.2013.116","DOIUrl":"https://doi.org/10.1109/IPDPSW.2013.116","url":null,"abstract":"Pattern matching is an important operation in various applications such as computer and network security, bioinformatics, image processing, among many others. Aho-Corasick (AC) algorithm is a multiple patterns matching algorithm commonly used for such applications. In order to meet the highly demanding performance requirements imposed on these applications, achieving high performance for AC algorithm is crucial. In this paper, we present a high performance parallel implementation of AC algorithm on a Graphic Processing Unit (GPU) which efficiently utilizes the high degree of on-chip parallelism and the memory hierarchy of the GPU so that the aggregate performance (or throughput) of the GPU can be maximized. For this purpose our approach carefully places and caches the input text data and the reference pattern data used for pattern matching in the on-chip shared memories and the texture caches of the GPU. Furthermore, it efficiently schedules the off-chip global memory loads and the shared memory stores in order to minimize the overheads in loading the input data to the shared memories and also to minimize the shared memory bank conflicts. The proposed approach leads to a significant cut-down of the effective memory access latencies and leads to impressive performance improvements. Experimental results on Nvidia GeForce GTX 285 GPU show that our approach delivers up to 127Gbps throughput performance and up to 222-times speedup compared with a serial version running on 2.2Ghz Core2Duo Intel processor.","PeriodicalId":234552,"journal":{"name":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131561222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jan Heisswolf, A. Weichslgartner, A. Zaib, Ralf König, Thomas Wild, A. Herkersdorf, J. Teich, J. Becker
{"title":"Hardware Supported Adaptive Data Collection for Networks on Chip","authors":"Jan Heisswolf, A. Weichslgartner, A. Zaib, Ralf König, Thomas Wild, A. Herkersdorf, J. Teich, J. Becker","doi":"10.1109/IPDPSW.2013.124","DOIUrl":"https://doi.org/10.1109/IPDPSW.2013.124","url":null,"abstract":"Managing future many-core architectures with hundreds of cores, running multiple applications in parallel, is very challenging. One of the major reasons is the communication overhead required to handle such a large system. Distributed management is proposed to reduce this overhead. The architecture is divided into regions which are managed separately. The instance managing the region and the applications running within the regions need to collect data for various reasons from time to time, e.g., to collect data for proper mapping decision, to synchronize tasks or to aggregate computation results. In this work, we propose and investigate different strategies for adaptive data collection in meshed Networks on Chip. The mechanisms can be used to collect data within regions, which are defined during run-time in respect of size and position. The mechanisms are investigated while considering delay, NoC utilization and implementation costs. The results show that the selection of the used mechanism depends on the requirements. Synthesis results compare area overhead, timing impact and energy consumption.","PeriodicalId":234552,"journal":{"name":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132779746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Christoph Schumacher, Jan Weinstock, R. Leupers, G. Ascheid, L. Tosoratto, A. Lonardo, D. Petras, Thorsten Grötker
{"title":"legaSCi: Legacy SystemC Model Integration into Parallel Systemc Simulators","authors":"Christoph Schumacher, Jan Weinstock, R. Leupers, G. Ascheid, L. Tosoratto, A. Lonardo, D. Petras, Thorsten Grötker","doi":"10.1109/IPDPSW.2013.34","DOIUrl":"https://doi.org/10.1109/IPDPSW.2013.34","url":null,"abstract":"Virtual prototyping of parallel and embedded systems increases insight into existing computer systems. It further allows to explore properties of new systems already during their specification phase. Virtual prototypes of such systems benefit from parallel simulation techniques due to the increased simulation speed. One common problem full system simulator implementers face is the revision and integration of legacy models coded without thread-safety and deterministic behavior in mind. To lessen this burden, this paper presents a methodology to integrate unmodified SystemC legacy models into parallel SystemC simulators. Using the proposed technique, the embedded platform simulator of the EU FP7 project EURETILE, which inherited a number of legacy models from its predecessor project SHAPES, has been transformed into a parallel simulation platform, demonstrating speed-ups of up to 3.36 on four simulation host cores.","PeriodicalId":234552,"journal":{"name":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"506 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133053677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"A Comprehensive Analysis of XOR-Based Erasure Codes Tolerating 3 or More Concurrent Failures","authors":"P. Subedi, Xubin He","doi":"10.1109/IPDPSW.2013.155","DOIUrl":"https://doi.org/10.1109/IPDPSW.2013.155","url":null,"abstract":"In large-scale database storage systems, RAID systems have gained popularity due to their capability to support multiple failures. As data loss is not an option in a system, recent studies focus on erasure codes for the RAID systems, which can tolerate three or more concurrent failures. This paper surveys recent XOR-based erasure codes and presents a comprehensive analysis and comparison among these codes. Moreover, some erasure codes that focus on clustered erasures are also discussed. The paper concludes with open research issues.","PeriodicalId":234552,"journal":{"name":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"26 8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133482628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"Cooperative MIMO Paradigms for Cognitive Radio Networks","authors":"Wei Chen, Liang Hong","doi":"10.1109/IPDPSW.2013.9","DOIUrl":"https://doi.org/10.1109/IPDPSW.2013.9","url":null,"abstract":"This paper investigates the benefits that cooperation brings to cognitive radio networks. We focus on the cooperative Multiple Input Multiple Output (MIMO) technology, where multiple distributed secondary users cooperate on data transmission and reception. Energy efficient cooperative MIMO paradigms are proposed to maximize the diversity gain and significantly improve the performance of both overlay and underlay systems. In the proposed overlay system the secondary users can assist (relay) the primary transmissions even when they are far away from the primary users. In the proposed underlay system the secondary users can share the primary users' frequency resources without any knowledge about the primary users' signals while meet the strict interference constraint that the transmitted spectral density of the secondary users falls below the noise floor at the primary receivers. Numerical and experimental results are provided in order to discuss the advantages and limits of the proposed paradigms.","PeriodicalId":234552,"journal":{"name":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133556677","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}