2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum最新文献

筛选
英文 中文
Scheduling a Parallel Sparse Direct Solver to Multiple GPUs 调度并行稀疏直接求解器到多个gpu
Kyungjoo Kim, V. Eijkhout
{"title":"Scheduling a Parallel Sparse Direct Solver to Multiple GPUs","authors":"Kyungjoo Kim, V. Eijkhout","doi":"10.1109/IPDPSW.2013.26","DOIUrl":"https://doi.org/10.1109/IPDPSW.2013.26","url":null,"abstract":"We present a sparse direct solver using multi-level task scheduling on a modern heterogeneous compute node consisting of a multi-core host processor and multiple GPU accelerators. Our direct solver is based on the multifrontal method, which is characterized by exploiting dense sub problems (fronts) related in an assembly tree. Critical to high performance of the solver is dynamic task allocation to account for the asymmetric performance of heterogeneous devices. Device-specific tasks are generated and adapted to different devices on the course of multifrontal factorization using multi-level matrix partitioning. Large blocks are used to provide coarse grain tasks for fast devices, and some of the blocks are recursively partitioned to supply fine-grained tasks for the next available (slower) devices. Experimental results are obtained from particular problems arising from a high order Finite Element Method.","PeriodicalId":234552,"journal":{"name":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"344 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132094138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
SAGE: Geo-Distributed Streaming Data Analysis in Clouds SAGE:云中的地理分布式流数据分析
R. Tudoran, Gabriel Antoniu, L. Bougé
{"title":"SAGE: Geo-Distributed Streaming Data Analysis in Clouds","authors":"R. Tudoran, Gabriel Antoniu, L. Bougé","doi":"10.1109/IPDPSW.2013.95","DOIUrl":"https://doi.org/10.1109/IPDPSW.2013.95","url":null,"abstract":"The continuous growth of sensor networks, stock exchanges, climate monitoring or scientific applications produces new streaming data at increasing rates. Managing and processing such data, sometimes generated from multiple geographical locations, raises important challenges as it requires real-time processing or data aggregation. Conventional solutions like DBMS, MapReduce or dedicated solutions adopting single-located environments fail to meet the demands required for processing the Geo-distributed streaming data. Public clouds like Azure, with data centers spread around the globe, offer the infrastructure which can handle such a processing. Our approach, proposes a service-oriented cloud architecture for performing the stream analysis, by composing services which are distributed among multiple cloud data centers. Hence, the computation is moved towards the multiple data sources exploiting the geographical data locality. The initial results showed good scalability of the approach, reaching 1000 cores in the Azure cloud, and performance improvements compared to single location processing of a factor of 3.3.","PeriodicalId":234552,"journal":{"name":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130833248","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Exploiting Content Similarity to Improve Memory Performance in Large-Scale High-Performance Computing Systems 利用内容相似度提高大规模高性能计算系统的内存性能
Scott Levy
{"title":"Exploiting Content Similarity to Improve Memory Performance in Large-Scale High-Performance Computing Systems","authors":"Scott Levy","doi":"10.1109/IPDPSW.2013.75","DOIUrl":"https://doi.org/10.1109/IPDPSW.2013.75","url":null,"abstract":"As we consider building the next generation of extreme-scale systems, many of the biggest challenges are related to memory characteristics. In particular, overcoming challenges related to resilience and memory bandwidth will require innovative strategies for improving the performance of main memory. In this paper, we propose to exploit memory content similarity to improve memory performance. We begin by presenting several novel strategies that leverage memory content similarity to improve system resilience and effective memory bandwidth. Additionally, we seek to understand the source of similarity in the memory of HPC applications.","PeriodicalId":234552,"journal":{"name":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126577591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Predict-More Router: A Low Latency NoC Router with More Route Predictions 预测-更多路由器:具有更多路由预测的低延迟NoC路由器
Yuan He, Hiroshi Sasaki, Shinobu Miwa, Hiroshi Nakamura
{"title":"Predict-More Router: A Low Latency NoC Router with More Route Predictions","authors":"Yuan He, Hiroshi Sasaki, Shinobu Miwa, Hiroshi Nakamura","doi":"10.1109/IPDPSW.2013.40","DOIUrl":"https://doi.org/10.1109/IPDPSW.2013.40","url":null,"abstract":"Network-on-Chip (NoC) is a critical part of the memory hierarchy of emerging multicores. Lowering its communication latency while preserving its bandwidth is key to achieving high system performance. By now, one of the most effective methods helps achieving this goal is prediction router (PR). PR works by predicting the route an incoming packet may be transferred to and it speculatively allocates resources (virtual channels and the switch crossbar) to the packet and traverses the packet's flits using this predicted route in a single cycle without waiting for route computation; however, if prediction misses, the packet will then be processed in the conventional pipeline (in our work, four cycles) and the speculatively allocated router resources will be wasted. Obviously, prediction accuracy contributes to the amount of successful predictions, latency reduction and bandwidth consumption. We find that predictions hit around 65% for most applications even under the best algorithm so in such cases PR can at most accelerate about 65% of the packets while the left 35% will consume extra router resources and bandwidth. In order to increase the prediction accuracy, we propose a technique, which makes use of multiple prediction algorithms at the same time for one incoming packet. Such a prediction is more accurate. With this proposal, we design and implement predict-more router (PmR). While effectively increasing the prediction accuracy, PmR also helps utilizing remaining bandwidth within the router more productively. When both PmR and PR are evaluated under their best algorithm(s), we find that PmR is over 15% higher in prediction accuracy than PR, which helps PmR outperform PR by 3.5% on average in speeding-up the system. We also find that although PmR creates more contentions in prediction, these contentions can be well resolved and are kept within the router so both router internal bandwidth and link bandwidth are not exacerbated with it.","PeriodicalId":234552,"journal":{"name":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"94 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125712229","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Energy-Efficient Sparse Matrix Autotuning with CSX -- A Trade-off Study 基于CSX的节能稀疏矩阵自整定——一个权衡研究
J. Meyer, J. M. Cebrian, L. Natvig, V. Karakasis, D. Siakavaras, K. Nikas
{"title":"Energy-Efficient Sparse Matrix Autotuning with CSX -- A Trade-off Study","authors":"J. Meyer, J. M. Cebrian, L. Natvig, V. Karakasis, D. Siakavaras, K. Nikas","doi":"10.1109/IPDPSW.2013.219","DOIUrl":"https://doi.org/10.1109/IPDPSW.2013.219","url":null,"abstract":"In this paper, we apply a method for extracting a running power estimate of applications from hardware performance counters, producing power/time curves which can be integrated over particular intervals to estimate the energy consumption of individual application stages. We use this method to instrument executions of a conjugate gradient solver, to examine the energy and performance impacts of applying the Compressed Sparse eXtended (CSX) and classic Compressed Sparse Row (CSR) matrix compression methods to sparse linear systems from different application areas. The CSX format requires a preprocessing stage which identifies and exploits a range of matrix substructures, incurring a one-time cost which can facilitate more effective sparse matrix-vector multiplication (SpMV). As this numerical kernel is the primary performance bottleneck of conjugate gradient solvers, we take the approach of isolating the energy cost of preprocessing from a short sample of application iterations, obtaining measurements which enlighten the choice of which compression scheme is more appropriate to the input data. We examine the impact variable degrees of parallelism, processor clock frequency, and Hyper threading have on this trade-off. Our results include comparisons of empirically obtained results from all combinations of up to 8 threads on 4 hyper threaded cores, 3 clock frequencies, and 5 sample application matrices. We assess program-hardware interactions with views to structural properties of the data and hardware architectural features, and evaluate the approach with respect to integrating the energy instrumentation with present automatic performance tuning. Results show that our method is sufficiently precise to identify non-trivial tradeoffs in the parameter space, and may become suitable for a run-time automatic tuning scheme by applying a faster preprocessing mode of CSX.","PeriodicalId":234552,"journal":{"name":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126618360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Energy Consumption Models and Predictions for Large-Scale Systems 大型系统的能源消耗模型与预测
T. Samak, C. Morin, D. Bailey
{"title":"Energy Consumption Models and Predictions for Large-Scale Systems","authors":"T. Samak, C. Morin, D. Bailey","doi":"10.1109/IPDPSW.2013.228","DOIUrl":"https://doi.org/10.1109/IPDPSW.2013.228","url":null,"abstract":"Responsible, efficient and well-planned power consumption is becoming a necessity for monetary returns and scalability of computing infrastructures. While there are numerous sources from which power data can be obtained, analyzing this data is an intrinsically hard task. In this paper, we propose a data analysis pipeline that can handle the large-scale collection of energy consumption logs, apply sophisticated modeling to enable accurate prediction, and evaluate the efficiency of the analysis approach. We present the analysis of a power consumption data set collected over a 6-month period from two clusters of the Grid'5000 experimentation platform used in production. To solve the large data challenge, we used Hadoop with Pig data processing to generate a summary of the data that provides basic statistical aggregations, over different time scales. The aggregate data is then analyzed as a time series using sophisticated modeling methods with R statistical software. Energy models from such large dataset can help in understanding the evolution of consumption patterns, predicting future energy trends, and providing basis for generalizing the energy models to similar large-scale systems.","PeriodicalId":234552,"journal":{"name":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"318 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116635811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Application Explorations for Future Interconnects 未来互联的应用探索
R. Barrett, C. Vaughan, S. Hammond, D. Roweth
{"title":"Application Explorations for Future Interconnects","authors":"R. Barrett, C. Vaughan, S. Hammond, D. Roweth","doi":"10.1109/IPDPSW.2013.60","DOIUrl":"https://doi.org/10.1109/IPDPSW.2013.60","url":null,"abstract":"For over two decades the dominant means for enabling portable performance of computational science and engineering applications on parallel processing architectures has been the bulk-synchronous parallel programming model. Code developers, motivated by performance considerations to minimize the number of messages transmitted, have typically strived to increase the size of each message through aggregation strategies. Emerging and future architectures, especially those seen as targeting Exascale capabilities, provide motivation and capabilities for revisiting this approach. In this paper we explore alternative configurations within the context of a large-scale complex multi-physics application and a proxy that represents its behavior, presenting results that demonstrate some important advantages as the number of processors increases in scale.","PeriodicalId":234552,"journal":{"name":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122512672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Fault Localization in NoCs Exploiting Periodic Heartbeat Messages in a Many-Core Environment 多核环境下利用周期心跳消息的noc故障定位
Arne Garbade, Sebastian Weis, Sebastian Schlingmann, Bernhard Fechner, T. Ungerer
{"title":"Fault Localization in NoCs Exploiting Periodic Heartbeat Messages in a Many-Core Environment","authors":"Arne Garbade, Sebastian Weis, Sebastian Schlingmann, Bernhard Fechner, T. Ungerer","doi":"10.1109/IPDPSW.2013.150","DOIUrl":"https://doi.org/10.1109/IPDPSW.2013.150","url":null,"abstract":"This paper presents a novel fault localization approach for NoCs by leveraging so called timed heartbeat messages. While these messages are periodically sent to report health states of processor cores to a fault detection unit, information about the network health state (topology) can be extracted from their timing behavior. We show how this health state information can be easily extracted from the message arrival times and give an estimation of the expected costs for this technique.","PeriodicalId":234552,"journal":{"name":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122844666","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
On Analyzing Large Graphs Using GPUs 用gpu分析大图形
A. Chatterjee, S. Radhakrishnan, J. Antonio
{"title":"On Analyzing Large Graphs Using GPUs","authors":"A. Chatterjee, S. Radhakrishnan, J. Antonio","doi":"10.1109/IPDPSW.2013.235","DOIUrl":"https://doi.org/10.1109/IPDPSW.2013.235","url":null,"abstract":"Studying properties of graphs is essential to various applications, and recent growth of online social networks has spurred interests in analyzing their structures using Graphical Processing Units (GPUs). Utilizing the faster available shared memory on GPUs have provided tremendous speed-up for solving many general-purpose problems. However, when data required for processing is large and needs to be stored in the global memory instead of the shared memory, simultaneous memory accesses by threads in execution becomes the bottleneck for achieving higher throughput. In this paper, for storing large graphs, we propose and evaluate techniques to efficiently utilize the different levels of the memory hierarchy of GPUs, with the focus being on the larger global memory. Given a graph G = (V, E), we provide an algorithm to count the number of triangles in G, while storing the adjacency information on the global memory. Our computation techniques and data structure for retrieving the adjacency information is derived from processing the breadth-first-search tree of the input graph. Also, techniques to generate combinations of nodes for testing the properties of graphs induced by the same are discussed in detail. Our methods can be extended to solve other combinatorial counting problems on graphs, such as finding the number of connected sub graphs of size k, number of cliques (resp. independent sets) of size k, and related problems for large data sets. In the context of the triangle counting algorithm, we analyze and utilize primitives such as memory access coalescing and avoiding partition camping that offset the increase in access latency of using a slower but larger global memory. Our experimental results for the GPU implementation show at least 10 times speedup for triangle counting over the CPU counterpart. Another 6 - 8% increase in performance is obtained by utilizing the above mentioned primitives as compared to the naïve implementation of the program on the GPU.","PeriodicalId":234552,"journal":{"name":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123072099","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Tridiagonalization of a Symmetric Dense Matrix on a GPU Cluster GPU聚类上对称密集矩阵的三对角化
I. Yamazaki, Tingxing Dong, S. Tomov, J. Dongarra
{"title":"Tridiagonalization of a Symmetric Dense Matrix on a GPU Cluster","authors":"I. Yamazaki, Tingxing Dong, S. Tomov, J. Dongarra","doi":"10.1109/IPDPSW.2013.265","DOIUrl":"https://doi.org/10.1109/IPDPSW.2013.265","url":null,"abstract":"Symmetric dense Eigen value problems arise in many scientific and engineering simulations. In this paper, we use GPUs to accelerate its main computational kernel, the tridiagonalization of a dense symmetric matrix on a distributed multicore architecture. We then study the performance of this hybrid message-passing/shared-memory/GPU-computing paradigm on up to 16 compute nodes, each of which consists of 16 Intel Sandy Bridge processors and three NVIDIA GPUs. These studies show that such a hybrid paradigm can exploit the underlying hardware architecture and obtain significant speedups over a flat message-passing paradigm can, and they demonstrate a potential of efficiently solving large-scale Eigen value problems on a GPU cluster. Furthermore, these studies may provide insights on the general effects of such hybrid paradigms on emerging high-performance computers.","PeriodicalId":234552,"journal":{"name":"2013 IEEE International Symposium on Parallel & Distributed Processing, Workshops and Phd Forum","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121759861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
0
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
确定
请完成安全验证×
相关产品
×
本文献相关产品
联系我们:info@booksci.cn Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。 Copyright © 2023 布克学术 All rights reserved.
京ICP备2023020795号-1
ghs 京公网安备 11010802042870号
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术官方微信