2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum最新文献_第3页

Toward Automatic Optimized Code Generation for Multiprecision Modular Exponentiation on a GPU 在GPU上实现多精度模幂运算的自动优化代码生成

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum Pub Date : 2013-05-20 DOI: 10.1109/IPDPSW.2013.149

Niall Emmart, C. Weems

{"title":"Toward Automatic Optimized Code Generation for Multiprecision Modular Exponentiation on a GPU","authors":"Niall Emmart, C. Weems","doi":"10.1109/IPDPSW.2013.149","DOIUrl":"https://doi.org/10.1109/IPDPSW.2013.149","url":null,"abstract":"Multiprocessing modular exponentiation has a variety of uses, including cryptography, prime testing and computational number theory. It is also a very costly operation to compute. GPU parallelism can be used to accelerate these computations, but to use the GPU efficiently, a problem must involve a significant number of simultaneous exponentiation operations. Handling a large number of TLS/SSL encrypted sessions in a data center is a significant problem that fits this profile. We have developed a framework that enables generation of highly efficient NVIDIA PTX implementations of exponentiation operations for different GPU architectures and problem instances. One of the challenges in generating such code is that PTX is not a true assembly language, but is instead a virtual instruction set that is compiled and optimized in different ways for different generations of GPU hardware. Thus, the same PTX code runs with different levels of efficiency on different machines. And as the precision of the exponentiation values changes, each architecture has its own break-even points where a different algorithm or parallelization strategy must be employed. To make the code efficient for a given problem instance and architecture thus requires searching a multidimensional space of algorithms and configurations, by generating thousands of lines of carefully constructed PTX code for each combination, executing it, validating the numerical result, and evaluating its actual performance. Our framework automates much of this process, and produces exponentiation code that is up to six times faster than the best known hand-coded implementations. More importantly, the framework enables users to relatively quickly find the best configuration for each new GPU architecture. Our framework is also the basis for the eventual creation of a multiprocessing matrix arithmetic package for GPU cluster systems that will be portable across multiple generations of GPU.","PeriodicalId":234552,"journal":{"name":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114682938","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Subdomain Mapping Approach to Enhance the Coupling in Earth System Modeling 增强地球系统建模耦合的子域映射方法

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum Pub Date : 2013-05-20 DOI: 10.1109/IPDPSW.2013.62

Yingsheng Ji, Guangwen Yang, Li Liu, Shu Wang

引用次数: 0

An On-chip Heterogeneous Implementation of a General Sparse Linear Solver 通用稀疏线性求解器的片上异构实现

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum Pub Date : 2013-05-20 DOI: 10.1109/IPDPSW.2013.51

Arash Sadrieh, Stefano Charissis, A. Hill

{"title":"An On-chip Heterogeneous Implementation of a General Sparse Linear Solver","authors":"Arash Sadrieh, Stefano Charissis, A. Hill","doi":"10.1109/IPDPSW.2013.51","DOIUrl":"https://doi.org/10.1109/IPDPSW.2013.51","url":null,"abstract":"Inter-device communication is a common limitation of GPGPU computing methods. The on-chip heterogeneous architecture of a recent class of accelerated processing units (APUs), that combine programmable CPU and GPU cores on the same die, presents an opportunity to address this problem. Here we describe an APU-based heterogeneous implementation of the Jacobi-preconditioned conjugate gradient method and identify a set of optimal configurations based on examination of standard matrices. By leveraging the low-latency memory transactions of the APU and exploiting CPU/GPU cohabitation for concurrent vector operations, a comparable performance to that of a high-end GPU running CUSP is achieved. Our results show that use of on-chip heterogeneous architectures can be attractively cost-effective and even show better performance for applications with a low number of linear solver iterations and when device-to-device data transfer is significant. Accordingly, the APU architecture and associated GPAPU methods have significant potential as a low cost, energy efficient alternative for parallel HPC architectures.","PeriodicalId":234552,"journal":{"name":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122270554","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Transparent Optimization of Parallel File System I/O via Standard System Tool Enhancement 通过标准系统工具增强实现并行文件系统I/O的透明优化

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum Pub Date : 2013-05-20 DOI: 10.1109/IPDPSW.2013.192

Paul Z. Kolano

引用次数: 3

High Performance GPU Accelerated Local Optimization in TSP TSP中的高性能GPU加速局部优化

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum Pub Date : 2013-05-20 DOI: 10.1109/IPDPSW.2013.227

K. Rocki, R. Suda

{"title":"High Performance GPU Accelerated Local Optimization in TSP","authors":"K. Rocki, R. Suda","doi":"10.1109/IPDPSW.2013.227","DOIUrl":"https://doi.org/10.1109/IPDPSW.2013.227","url":null,"abstract":"This paper presents a high performance GPU accelerated implementation of 2-opt local search algorithm for the Traveling Salesman Problem (TSP). GPU usage significantly decreases the execution time needed for tour optimization, however it also requires a complicated and well tuned implementation. With the problem size growing, the time spent on local optimization comparing the graph edges grows significantly. According to our results based on the instances from the TSPLIB library, the time needed to perform a simple local search operation can be decreased approximately 5 to 45 times compared to a corresponding parallel CPU code implementation using 6 cores. The code has been implemented in OpenCL and as well as in CUDA and tested on AMD and NVIDIA devices. The experimental studies show that the optimization algorithm using the GPU local search converges from up to 300 times faster compared to the sequential CPU version on average, depending on the problem size. The main contributions of this paper are the problem division scheme exploiting data locality which allows to solve arbitrarily big problem instances using GPU and the parallel implementation of the algorithm itself.","PeriodicalId":234552,"journal":{"name":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"219 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134541628","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 20

High Throughput Parallel Implementation of Aho-Corasick Algorithm on a GPU 基于GPU的ho- corasick算法的高吞吐量并行实现

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum Pub Date : 2013-05-20 DOI: 10.1109/IPDPSW.2013.116

Nhat-Phuong Tran, Myungho Lee, Sugwon Hong, Jaeyoung Choi

{"title":"High Throughput Parallel Implementation of Aho-Corasick Algorithm on a GPU","authors":"Nhat-Phuong Tran, Myungho Lee, Sugwon Hong, Jaeyoung Choi","doi":"10.1109/IPDPSW.2013.116","DOIUrl":"https://doi.org/10.1109/IPDPSW.2013.116","url":null,"abstract":"Pattern matching is an important operation in various applications such as computer and network security, bioinformatics, image processing, among many others. Aho-Corasick (AC) algorithm is a multiple patterns matching algorithm commonly used for such applications. In order to meet the highly demanding performance requirements imposed on these applications, achieving high performance for AC algorithm is crucial. In this paper, we present a high performance parallel implementation of AC algorithm on a Graphic Processing Unit (GPU) which efficiently utilizes the high degree of on-chip parallelism and the memory hierarchy of the GPU so that the aggregate performance (or throughput) of the GPU can be maximized. For this purpose our approach carefully places and caches the input text data and the reference pattern data used for pattern matching in the on-chip shared memories and the texture caches of the GPU. Furthermore, it efficiently schedules the off-chip global memory loads and the shared memory stores in order to minimize the overheads in loading the input data to the shared memories and also to minimize the shared memory bank conflicts. The proposed approach leads to a significant cut-down of the effective memory access latencies and leads to impressive performance improvements. Experimental results on Nvidia GeForce GTX 285 GPU show that our approach delivers up to 127Gbps throughput performance and up to 222-times speedup compared with a serial version running on 2.2Ghz Core2Duo Intel processor.","PeriodicalId":234552,"journal":{"name":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131561222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Hardware Supported Adaptive Data Collection for Networks on Chip 硬件支持的芯片上网络自适应数据采集

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum Pub Date : 2013-05-20 DOI: 10.1109/IPDPSW.2013.124

Jan Heisswolf, A. Weichslgartner, A. Zaib, Ralf König, Thomas Wild, A. Herkersdorf, J. Teich, J. Becker

{"title":"Hardware Supported Adaptive Data Collection for Networks on Chip","authors":"Jan Heisswolf, A. Weichslgartner, A. Zaib, Ralf König, Thomas Wild, A. Herkersdorf, J. Teich, J. Becker","doi":"10.1109/IPDPSW.2013.124","DOIUrl":"https://doi.org/10.1109/IPDPSW.2013.124","url":null,"abstract":"Managing future many-core architectures with hundreds of cores, running multiple applications in parallel, is very challenging. One of the major reasons is the communication overhead required to handle such a large system. Distributed management is proposed to reduce this overhead. The architecture is divided into regions which are managed separately. The instance managing the region and the applications running within the regions need to collect data for various reasons from time to time, e.g., to collect data for proper mapping decision, to synchronize tasks or to aggregate computation results. In this work, we propose and investigate different strategies for adaptive data collection in meshed Networks on Chip. The mechanisms can be used to collect data within regions, which are defined during run-time in respect of size and position. The mechanisms are investigated while considering delay, NoC utilization and implementation costs. The results show that the selection of the used mechanism depends on the requirements. Synthesis results compare area overhead, timing impact and energy consumption.","PeriodicalId":234552,"journal":{"name":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132779746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

legaSCi: Legacy SystemC Model Integration into Parallel Systemc Simulators 遗留系统模型集成到并行系统模拟器

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum Pub Date : 2013-05-20 DOI: 10.1109/IPDPSW.2013.34

Christoph Schumacher, Jan Weinstock, R. Leupers, G. Ascheid, L. Tosoratto, A. Lonardo, D. Petras, Thorsten Grötker

引用次数: 13

A Comprehensive Analysis of XOR-Based Erasure Codes Tolerating 3 or More Concurrent Failures 基于xor的可容忍3个或更多并发故障的Erasure码的综合分析

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum Pub Date : 2013-05-20 DOI: 10.1109/IPDPSW.2013.155

P. Subedi, Xubin He

引用次数: 13

Cooperative MIMO Paradigms for Cognitive Radio Networks 认知无线网络的协同MIMO范式

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum Pub Date : 2013-05-20 DOI: 10.1109/IPDPSW.2013.9

Wei Chen, Liang Hong

引用次数: 6