2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum最新文献_第5页

Scheduling a Parallel Sparse Direct Solver to Multiple GPUs 调度并行稀疏直接求解器到多个gpu

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum Pub Date : 2013-05-20 DOI: 10.1109/IPDPSW.2013.26

Kyungjoo Kim, V. Eijkhout

引用次数: 11

SAGE: Geo-Distributed Streaming Data Analysis in Clouds SAGE:云中的地理分布式流数据分析

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum Pub Date : 2013-05-20 DOI: 10.1109/IPDPSW.2013.95

R. Tudoran, Gabriel Antoniu, L. Bougé

引用次数: 12

Exploiting Content Similarity to Improve Memory Performance in Large-Scale High-Performance Computing Systems 利用内容相似度提高大规模高性能计算系统的内存性能

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum Pub Date : 2013-05-20 DOI: 10.1109/IPDPSW.2013.75

Scott Levy

引用次数: 0

Predict-More Router: A Low Latency NoC Router with More Route Predictions 预测-更多路由器:具有更多路由预测的低延迟NoC路由器

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum Pub Date : 2013-05-20 DOI: 10.1109/IPDPSW.2013.40

Yuan He, Hiroshi Sasaki, Shinobu Miwa, Hiroshi Nakamura

{"title":"Predict-More Router: A Low Latency NoC Router with More Route Predictions","authors":"Yuan He, Hiroshi Sasaki, Shinobu Miwa, Hiroshi Nakamura","doi":"10.1109/IPDPSW.2013.40","DOIUrl":"https://doi.org/10.1109/IPDPSW.2013.40","url":null,"abstract":"Network-on-Chip (NoC) is a critical part of the memory hierarchy of emerging multicores. Lowering its communication latency while preserving its bandwidth is key to achieving high system performance. By now, one of the most effective methods helps achieving this goal is prediction router (PR). PR works by predicting the route an incoming packet may be transferred to and it speculatively allocates resources (virtual channels and the switch crossbar) to the packet and traverses the packet's flits using this predicted route in a single cycle without waiting for route computation; however, if prediction misses, the packet will then be processed in the conventional pipeline (in our work, four cycles) and the speculatively allocated router resources will be wasted. Obviously, prediction accuracy contributes to the amount of successful predictions, latency reduction and bandwidth consumption. We find that predictions hit around 65% for most applications even under the best algorithm so in such cases PR can at most accelerate about 65% of the packets while the left 35% will consume extra router resources and bandwidth. In order to increase the prediction accuracy, we propose a technique, which makes use of multiple prediction algorithms at the same time for one incoming packet. Such a prediction is more accurate. With this proposal, we design and implement predict-more router (PmR). While effectively increasing the prediction accuracy, PmR also helps utilizing remaining bandwidth within the router more productively. When both PmR and PR are evaluated under their best algorithm(s), we find that PmR is over 15% higher in prediction accuracy than PR, which helps PmR outperform PR by 3.5% on average in speeding-up the system. We also find that although PmR creates more contentions in prediction, these contentions can be well resolved and are kept within the router so both router internal bandwidth and link bandwidth are not exacerbated with it.","PeriodicalId":234552,"journal":{"name":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125712229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Energy-Efficient Sparse Matrix Autotuning with CSX -- A Trade-off Study 基于CSX的节能稀疏矩阵自整定——一个权衡研究

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum Pub Date : 2013-05-20 DOI: 10.1109/IPDPSW.2013.219

J. Meyer, J. M. Cebrian, L. Natvig, V. Karakasis, D. Siakavaras, K. Nikas

{"title":"Energy-Efficient Sparse Matrix Autotuning with CSX -- A Trade-off Study","authors":"J. Meyer, J. M. Cebrian, L. Natvig, V. Karakasis, D. Siakavaras, K. Nikas","doi":"10.1109/IPDPSW.2013.219","DOIUrl":"https://doi.org/10.1109/IPDPSW.2013.219","url":null,"abstract":"In this paper, we apply a method for extracting a running power estimate of applications from hardware performance counters, producing power/time curves which can be integrated over particular intervals to estimate the energy consumption of individual application stages. We use this method to instrument executions of a conjugate gradient solver, to examine the energy and performance impacts of applying the Compressed Sparse eXtended (CSX) and classic Compressed Sparse Row (CSR) matrix compression methods to sparse linear systems from different application areas. The CSX format requires a preprocessing stage which identifies and exploits a range of matrix substructures, incurring a one-time cost which can facilitate more effective sparse matrix-vector multiplication (SpMV). As this numerical kernel is the primary performance bottleneck of conjugate gradient solvers, we take the approach of isolating the energy cost of preprocessing from a short sample of application iterations, obtaining measurements which enlighten the choice of which compression scheme is more appropriate to the input data. We examine the impact variable degrees of parallelism, processor clock frequency, and Hyper threading have on this trade-off. Our results include comparisons of empirically obtained results from all combinations of up to 8 threads on 4 hyper threaded cores, 3 clock frequencies, and 5 sample application matrices. We assess program-hardware interactions with views to structural properties of the data and hardware architectural features, and evaluate the approach with respect to integrating the energy instrumentation with present automatic performance tuning. Results show that our method is sufficiently precise to identify non-trivial tradeoffs in the parameter space, and may become suitable for a run-time automatic tuning scheme by applying a faster preprocessing mode of CSX.","PeriodicalId":234552,"journal":{"name":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126618360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Energy Consumption Models and Predictions for Large-Scale Systems 大型系统的能源消耗模型与预测

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum Pub Date : 2013-05-20 DOI: 10.1109/IPDPSW.2013.228

T. Samak, C. Morin, D. Bailey

{"title":"Energy Consumption Models and Predictions for Large-Scale Systems","authors":"T. Samak, C. Morin, D. Bailey","doi":"10.1109/IPDPSW.2013.228","DOIUrl":"https://doi.org/10.1109/IPDPSW.2013.228","url":null,"abstract":"Responsible, efficient and well-planned power consumption is becoming a necessity for monetary returns and scalability of computing infrastructures. While there are numerous sources from which power data can be obtained, analyzing this data is an intrinsically hard task. In this paper, we propose a data analysis pipeline that can handle the large-scale collection of energy consumption logs, apply sophisticated modeling to enable accurate prediction, and evaluate the efficiency of the analysis approach. We present the analysis of a power consumption data set collected over a 6-month period from two clusters of the Grid'5000 experimentation platform used in production. To solve the large data challenge, we used Hadoop with Pig data processing to generate a summary of the data that provides basic statistical aggregations, over different time scales. The aggregate data is then analyzed as a time series using sophisticated modeling methods with R statistical software. Energy models from such large dataset can help in understanding the evolution of consumption patterns, predicting future energy trends, and providing basis for generalizing the energy models to similar large-scale systems.","PeriodicalId":234552,"journal":{"name":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"318 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116635811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Application Explorations for Future Interconnects 未来互联的应用探索

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum Pub Date : 2013-05-20 DOI: 10.1109/IPDPSW.2013.60

R. Barrett, C. Vaughan, S. Hammond, D. Roweth

引用次数: 2

Fault Localization in NoCs Exploiting Periodic Heartbeat Messages in a Many-Core Environment 多核环境下利用周期心跳消息的noc故障定位

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum Pub Date : 2013-05-20 DOI: 10.1109/IPDPSW.2013.150

Arne Garbade, Sebastian Weis, Sebastian Schlingmann, Bernhard Fechner, T. Ungerer

引用次数: 5

On Analyzing Large Graphs Using GPUs 用gpu分析大图形

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum Pub Date : 2013-05-20 DOI: 10.1109/IPDPSW.2013.235

A. Chatterjee, S. Radhakrishnan, J. Antonio

{"title":"On Analyzing Large Graphs Using GPUs","authors":"A. Chatterjee, S. Radhakrishnan, J. Antonio","doi":"10.1109/IPDPSW.2013.235","DOIUrl":"https://doi.org/10.1109/IPDPSW.2013.235","url":null,"abstract":"Studying properties of graphs is essential to various applications, and recent growth of online social networks has spurred interests in analyzing their structures using Graphical Processing Units (GPUs). Utilizing the faster available shared memory on GPUs have provided tremendous speed-up for solving many general-purpose problems. However, when data required for processing is large and needs to be stored in the global memory instead of the shared memory, simultaneous memory accesses by threads in execution becomes the bottleneck for achieving higher throughput. In this paper, for storing large graphs, we propose and evaluate techniques to efficiently utilize the different levels of the memory hierarchy of GPUs, with the focus being on the larger global memory. Given a graph G = (V, E), we provide an algorithm to count the number of triangles in G, while storing the adjacency information on the global memory. Our computation techniques and data structure for retrieving the adjacency information is derived from processing the breadth-first-search tree of the input graph. Also, techniques to generate combinations of nodes for testing the properties of graphs induced by the same are discussed in detail. Our methods can be extended to solve other combinatorial counting problems on graphs, such as finding the number of connected sub graphs of size k, number of cliques (resp. independent sets) of size k, and related problems for large data sets. In the context of the triangle counting algorithm, we analyze and utilize primitives such as memory access coalescing and avoiding partition camping that offset the increase in access latency of using a slower but larger global memory. Our experimental results for the GPU implementation show at least 10 times speedup for triangle counting over the CPU counterpart. Another 6 - 8% increase in performance is obtained by utilizing the above mentioned primitives as compared to the naïve implementation of the program on the GPU.","PeriodicalId":234552,"journal":{"name":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123072099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Tridiagonalization of a Symmetric Dense Matrix on a GPU Cluster GPU聚类上对称密集矩阵的三对角化

2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum Pub Date : 2013-05-20 DOI: 10.1109/IPDPSW.2013.265

I. Yamazaki, Tingxing Dong, S. Tomov, J. Dongarra

引用次数: 4